diff mbox series

[v4] mm: oom: introduce cpuset oom

Message ID 20230411065816.9798-1-ligang.bdlg@bytedance.com (mailing list archive)
State New
Headers show
Series [v4] mm: oom: introduce cpuset oom | expand

Commit Message

Gang Li April 11, 2023, 6:58 a.m. UTC
Cpusets constrain the CPU and Memory placement of tasks.
`CONSTRAINT_CPUSET` type in oom  has existed for a long time, but
has never been utilized.

When a process in cpuset which constrain memory placement triggers
oom, it may kill a completely irrelevant process on other numa nodes,
which will not release any memory for this cpuset.

We can easily achieve node aware oom by using `CONSTRAINT_CPUSET` and
selecting victim from cpusets with the same mems_allowed as the
current one.

Example:

Create two processes named mem_on_node0 and mem_on_node1 constrained
by cpusets respectively. These two processes alloc memory on their
own node. Now node0 has run out of memory, OOM will be invokled by
mem_on_node0.

Before this patch:

Since `CONSTRAINT_CPUSET` do nothing, the victim will be selected from
the entire system. Therefore, the OOM is highly likely to kill
mem_on_node1, which will not free any memory for mem_on_node0. This
is a useless kill.

```
[ 2786.519080] mem_on_node0 invoked oom-killer
[ 2786.885738] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 2787.181724] [  13432]     0 13432   787016   786745  6344704        0             0 mem_on_node1
[ 2787.189115] [  13457]     0 13457   787002   785504  6340608        0             0 mem_on_node0
[ 2787.216534] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0
[ 2787.229991] Out of memory: Killed process 13432 (mem_on_node1)
```

After this patch:

The victim will be selected only in all cpusets that have the same
mems_allowed as the cpuset that invoked oom. This will prevent
useless kill and protect innocent victims.

```
[  395.922444] mem_on_node0 invoked oom-killer
[  396.239777] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[  396.246128] [   2614]     0  2614  1311294  1144192  9224192        0             0 mem_on_node0
[  396.252655] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0
[  396.264068] Out of memory: Killed process 2614 (mem_on_node0)
```

Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: <cgroups@vger.kernel.org>
Cc: <linux-mm@kvack.org>
Cc: <rientjes@google.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
---
Changes in v4:
- Modify comments and documentation.

Changes in v3:
- https://lore.kernel.org/all/20230410025056.22103-1-ligang.bdlg@bytedance.com/
- Provide more details about the use case, testing, implementation.
- Document the userspace visible change in Documentation.
- Rename cpuset_cgroup_scan_tasks() to cpuset_scan_tasks() and add
  a doctext comment about its purpose and how it should be used.
- Take cpuset_rwsem to ensure that cpusets are stable.

Changes in v2:
- https://lore.kernel.org/all/20230404115509.14299-1-ligang.bdlg@bytedance.com/
- Select victim from all cpusets with the same mems_allowed as the current cpuset.

v1:
- https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/
- Introduce cpuset oom.
---
 .../admin-guide/cgroup-v1/cpusets.rst         | 16 ++++++-
 Documentation/admin-guide/cgroup-v2.rst       |  4 ++
 include/linux/cpuset.h                        |  6 +++
 kernel/cgroup/cpuset.c                        | 43 +++++++++++++++++++
 mm/oom_kill.c                                 |  4 ++
 5 files changed, 71 insertions(+), 2 deletions(-)

Comments

Michal Koutný April 11, 2023, 12:23 p.m. UTC | #1
Hello.

On Tue, Apr 11, 2023 at 02:58:15PM +0800, Gang Li <ligang.bdlg@bytedance.com> wrote:
> +int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
> +{
> +	int ret = 0;
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +	struct cpuset *cs;
> +	struct cgroup_subsys_state *pos_css;
> +
> +	/*
> +	 * Situation gets complex with overlapping nodemasks in different cpusets.
> +	 * TODO: Maybe we should calculate the "distance" between different mems_allowed.
> +	 *
> +	 * But for now, let's make it simple. Just iterate through all cpusets
> +	 * with the same mems_allowed as the current cpuset.
> +	 */
> +	cpuset_read_lock();
> +	rcu_read_lock();
> +	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
> +		if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
> +			css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
> +			while (!ret && (task = css_task_iter_next(&it)))
> +				ret = fn(task, arg);
> +			css_task_iter_end(&it);
> +		}
> +	}
> +	rcu_read_unlock();
> +	cpuset_read_unlock();
> +	return ret;
> +}

I see this traverses all cpusets without the hierarchy actually
mattering that much. Wouldn't the CONSTRAINT_CPUSET better achieved by
globally (or per-memcg) scanning all processes and filtering with:
	nodes_intersect(current->mems_allowed, p->mems_allowed)
(`current` triggers the OOM, `p` is the iterated task)
?

Thanks,
Michal
Gang Li April 11, 2023, 1:04 p.m. UTC | #2
On 2023/4/11 20:23, Michal Koutný wrote:
> Hello.
> 
> On Tue, Apr 11, 2023 at 02:58:15PM +0800, Gang Li <ligang.bdlg@bytedance.com> wrote:
>> +	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
>> +		if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
>> +			css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
>> +			while (!ret && (task = css_task_iter_next(&it)))
>> +				ret = fn(task, arg);
>> +			css_task_iter_end(&it);
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +	cpuset_read_unlock();
>> +	return ret;
>> +}
> 
> I see this traverses all cpusets without the hierarchy actually
> mattering that much. Wouldn't the CONSTRAINT_CPUSET better achieved by
> globally (or per-memcg) scanning all processes and filtering with:

Oh I see, you mean scanning all processes in all cpusets and scanning
all processes globally are equivalent.

> 	nodes_intersect(current->mems_allowed, p->mems_allowed

Perhaps it would be better to use nodes_equal first, and if no suitable
victim is found, then downgrade to nodes_intersect?

NUMA balancing mechanism tends to keep memory on the same NUMA node, and
if the selected victim's memory happens to be on a node that does not
intersect with the current process's node, we still won't be able to
free up any memory.

In this example:

A->mems_allowed: 0,1
B->mems_allowed: 1,2
nodes_intersect(A->mems_allowed, B->mems_allowed) == true

Memory Distribution:
+=======+=======+=======+
| Node0 | Node1 | Node2 |
+=======+=======+=======+
| A     |       |       |
+-------+-------+-------+
|       |       |B      |
+-------+-------+-------+

Process A invoke oom, then kill B.
But A still can't get any free mem on Node0 and 1.

> (`current` triggers the OOM, `p` is the iterated task)
> ?
> 
> Thanks,
> Michal
Michal Hocko April 11, 2023, 1:12 p.m. UTC | #3
On Tue 11-04-23 21:04:18, Gang Li wrote:
> 
> 
> On 2023/4/11 20:23, Michal Koutný wrote:
> > Hello.
> > 
> > On Tue, Apr 11, 2023 at 02:58:15PM +0800, Gang Li <ligang.bdlg@bytedance.com> wrote:
> > > +	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
> > > +		if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
> > > +			css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
> > > +			while (!ret && (task = css_task_iter_next(&it)))
> > > +				ret = fn(task, arg);
> > > +			css_task_iter_end(&it);
> > > +		}
> > > +	}
> > > +	rcu_read_unlock();
> > > +	cpuset_read_unlock();
> > > +	return ret;
> > > +}
> > 
> > I see this traverses all cpusets without the hierarchy actually
> > mattering that much. Wouldn't the CONSTRAINT_CPUSET better achieved by
> > globally (or per-memcg) scanning all processes and filtering with:
> 
> Oh I see, you mean scanning all processes in all cpusets and scanning
> all processes globally are equivalent.

Why cannot you simple select a process from the cpuset the allocating
process belongs to? I thought the whole idea was to handle well
partitioned workloads.

> > 	nodes_intersect(current->mems_allowed, p->mems_allowed
> 
> Perhaps it would be better to use nodes_equal first, and if no suitable
> victim is found, then downgrade to nodes_intersect?

How can this happen?

> NUMA balancing mechanism tends to keep memory on the same NUMA node, and
> if the selected victim's memory happens to be on a node that does not
> intersect with the current process's node, we still won't be able to
> free up any memory.

AFAIR NUMA balancing doesn't touch processes with memory policies.
Gang Li April 11, 2023, 1:17 p.m. UTC | #4
On 2023/4/11 21:12, Michal Hocko wrote:
> On Tue 11-04-23 21:04:18, Gang Li wrote:
>>
>>
>> On 2023/4/11 20:23, Michal Koutný wrote:
>>> Hello.
>>>
>>> On Tue, Apr 11, 2023 at 02:58:15PM +0800, Gang Li <ligang.bdlg@bytedance.com> wrote:
>>>> +	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
>>>> +		if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
>>>> +			css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
>>>> +			while (!ret && (task = css_task_iter_next(&it)))
>>>> +				ret = fn(task, arg);
>>>> +			css_task_iter_end(&it);
>>>> +		}
>>>> +	}
>>>> +	rcu_read_unlock();
>>>> +	cpuset_read_unlock();
>>>> +	return ret;
>>>> +}
>>>
>>> I see this traverses all cpusets without the hierarchy actually
>>> mattering that much. Wouldn't the CONSTRAINT_CPUSET better achieved by
>>> globally (or per-memcg) scanning all processes and filtering with:
>>
>> Oh I see, you mean scanning all processes in all cpusets and scanning
>> all processes globally are equivalent.
> 
> Why cannot you simple select a process from the cpuset the allocating
> process belongs to? I thought the whole idea was to handle well
> partitioned workloads.
>

Yes I can :) It's much easier.

>>> 	nodes_intersect(current->mems_allowed, p->mems_allowed
>>
>> Perhaps it would be better to use nodes_equal first, and if no suitable
>> victim is found, then downgrade to nodes_intersect?
> 
> How can this happen?
> 
>> NUMA balancing mechanism tends to keep memory on the same NUMA node, and
>> if the selected victim's memory happens to be on a node that does not
>> intersect with the current process's node, we still won't be able to
>> free up any memory.
> 
> AFAIR NUMA balancing doesn't touch processes with memory policies.
Michal Hocko April 11, 2023, 2:36 p.m. UTC | #5
On Tue 11-04-23 14:58:15, Gang Li wrote:
> Cpusets constrain the CPU and Memory placement of tasks.
> `CONSTRAINT_CPUSET` type in oom  has existed for a long time, but
> has never been utilized.
> 
> When a process in cpuset which constrain memory placement triggers
> oom, it may kill a completely irrelevant process on other numa nodes,
> which will not release any memory for this cpuset.
> 
> We can easily achieve node aware oom by using `CONSTRAINT_CPUSET` and
> selecting victim from cpusets with the same mems_allowed as the
> current one.

I believe it still wouldn't hurt to be more specific here.
CONSTRAINT_CPUSET is rather obscure. Looking at this just makes my head
spin.
        /* Check this allocation failure is caused by cpuset's wall function */
        for_each_zone_zonelist_nodemask(zone, z, oc->zonelist, 
                        highest_zoneidx, oc->nodemask)
                if (!cpuset_zone_allowed(zone, oc->gfp_mask))
                        cpuset_limited = true;

Does this even work properly and why? prepare_alloc_pages sets
oc->nodemask to current->mems_allowed but the above gives us
cpuset_limited only if there is at least one zone/node that is not
oc->nodemask compatible. So it seems like this wouldn't ever get set
unless oc->nodemask got reset somewhere. This is a maze indeed. Is there
any reason why we cannot rely on __GFP_HARWALL here? Or should we
instead rely on the fact the nodemask should be same as
current->mems_allowed?

I do realize that this is not directly related to your patch but
considering this has been mostly doing nothing maybe we want to document
it better or even rework it at this occasion.

> Example:
> 
> Create two processes named mem_on_node0 and mem_on_node1 constrained
> by cpusets respectively. These two processes alloc memory on their
> own node. Now node0 has run out of memory, OOM will be invokled by
> mem_on_node0.

Don't you have an actual real life example with a properly partitioned
system which clearly misbehaves and this patch addresses that?
Michal Koutný April 11, 2023, 3:08 p.m. UTC | #6
On Tue, Apr 11, 2023 at 03:12:34PM +0200, Michal Hocko <mhocko@suse.com> wrote:
> > Oh I see, you mean scanning all processes in all cpusets and scanning
> > all processes globally are equivalent.
> 
> Why cannot you simple select a process from the cpuset the allocating
> process belongs to? I thought the whole idea was to handle well
> partitioned workloads.

Ah, I was confused by the top_cpuset implementation.

The iteration should then start in
	nearest_hardwall_ancestor(task_cs(current)).
(in the 1st approximation).

The nodes_intersect/nodes_subset/nodes_equal/whatnot heuristics is
secondary.

HTH,
Michal
Gang Li Aug. 17, 2023, 8:40 a.m. UTC | #7
Apologize for the extremely delayed response. I was previously occupied
with work unrelated to the Linux kernel.

On 2023/4/11 22:36, Michal Hocko wrote:
> I believe it still wouldn't hurt to be more specific here.
> CONSTRAINT_CPUSET is rather obscure. Looking at this just makes my head
> spin.
>          /* Check this allocation failure is caused by cpuset's wall function */
>          for_each_zone_zonelist_nodemask(zone, z, oc->zonelist,
>                          highest_zoneidx, oc->nodemask)
>                  if (!cpuset_zone_allowed(zone, oc->gfp_mask))
>                          cpuset_limited = true;
> > Does this even work properly and why? prepare_alloc_pages sets
> oc->nodemask to current->mems_allowed but the above gives us
> cpuset_limited only if there is at least one zone/node that is not
> oc->nodemask compatible. So it seems like this wouldn't ever get set
> unless oc->nodemask got reset somewhere. This is a maze indeed.Is there

In __alloc_pages:
```
/*
  * Restore the original nodemask if it was potentially replaced with
  * &cpuset_current_mems_allowed to optimize the fast-path attempt.
  */
ac.nodemask = nodemask;
page = __alloc_pages_slowpath(alloc_gfp, order, &ac);

```

__alloc_pages set ac.nodemask back to mempolicy before call
__alloc_pages_slowpath.


> any reason why we cannot rely on __GFP_HARWALL here? Or should we

In prepare_alloc_pages:
```
if (cpusets_enabled()) {
	*alloc_gfp |= __GFP_HARDWALL;
	...
}
```

Since __GFP_HARDWALL is set as long as cpuset is enabled, I think we can
use it to determine if we are under the constraint of CPUSET.

But I have a question: Why we always set __GFP_HARDWALL when cpuset is
enabled, regardless of the value of cpuset.mem_hardwall?


Thanks,
Gang Li
Waiman Long Aug. 17, 2023, 4:45 p.m. UTC | #8
On 8/17/23 04:40, Gang Li wrote:
>
> Since __GFP_HARDWALL is set as long as cpuset is enabled, I think we can
> use it to determine if we are under the constraint of CPUSET.
>
> But I have a question: Why we always set __GFP_HARDWALL when cpuset is
> enabled, regardless of the value of cpuset.mem_hardwall?

There is no direct dependency between cpuset.mem_hardwall and the 
__GFP_HARDWALL flag. When cpuset is enabled, all user memory allocation 
should be subjected to the cpuset memory constraint. In the case of 
non-user memory allocation, it can fall back to to the node mask of an 
ancestor up to the root cgroup, i.e. all memory nodes. 
cpuset.mem_hardwall enables a barrier to this upward search.

Note that cpuset.mem_hardwall is a v1 feature that is not available in 
cgroup v2.

Cheers,
Longman
Gang Li Aug. 22, 2023, 6:31 a.m. UTC | #9
Hi,

On 2023/8/17 16:40, Gang Li wrote:
> On 2023/4/11 22:36, Michal Hocko wrote:
>> I believe it still wouldn't hurt to be more specific here.
>> CONSTRAINT_CPUSET is rather obscure. Looking at this just makes my head
>> spin.
>>          /* Check this allocation failure is caused by cpuset's wall 
>> function */
>>          for_each_zone_zonelist_nodemask(zone, z, oc->zonelist,
>>                          highest_zoneidx, oc->nodemask)
>>                  if (!cpuset_zone_allowed(zone, oc->gfp_mask))
>>                          cpuset_limited = true;
>> > Does this even work properly and why? prepare_alloc_pages sets
>> oc->nodemask to current->mems_allowed but the above gives us
>> cpuset_limited only if there is at least one zone/node that is not
>> oc->nodemask compatible. So it seems like this wouldn't ever get set
>> unless oc->nodemask got reset somewhere. This is a maze indeed.Is there
> 
> In __alloc_pages:
> ```
> /*
>   * Restore the original nodemask if it was potentially replaced with
>   * &cpuset_current_mems_allowed to optimize the fast-path attempt.
>   */
> ac.nodemask = nodemask;
> page = __alloc_pages_slowpath(alloc_gfp, order, &ac);
> 
> ```
> 
> __alloc_pages set ac.nodemask back to mempolicy before call
> __alloc_pages_slowpath.
> 
> 
>> any reason why we cannot rely on __GFP_HARWALL here? Or should we
> 
> In prepare_alloc_pages:
> ```
> if (cpusets_enabled()) {
>      *alloc_gfp |= __GFP_HARDWALL;
>      ...
> }
> ```
> 
> Since __GFP_HARDWALL is set as long as cpuset is enabled, I think we can
> use it to determine if we are under the constraint of CPUSET.
> 

We have two nodemasks: one from the parameters of __alloc_pages and
another from cpuset. If the node allowed by the parameters of
__alloc_pages is not allowed by cpuset, it means that this page
allocation is constrained by cpuset, and thus CONSTRAINT_CPUSET can be
returned.

I guess this piece of code is reasonable and we can keep the
code as it is.

Thanks,
Gang Li.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst
index 5d844ed4df69..51ffdc0eb167 100644
--- a/Documentation/admin-guide/cgroup-v1/cpusets.rst
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
@@ -25,7 +25,8 @@  Written by Simon.Derr@bull.net
      1.6 What is memory spread ?
      1.7 What is sched_load_balance ?
      1.8 What is sched_relax_domain_level ?
-     1.9 How do I use cpusets ?
+     1.9 What is cpuset oom ?
+     1.10 How do I use cpusets ?
    2. Usage Examples and Syntax
      2.1 Basic Usage
      2.2 Adding/removing cpus
@@ -607,8 +608,19 @@  If your situation is:
  - The latency is required even it sacrifices cache hit rate etc.
    then increasing 'sched_relax_domain_level' would benefit you.
 
+1.9 What is cpuset oom ?
+--------------------------
+If there is no available memory to allocate on the nodes specified by
+cpuset.mems, then an OOM (Out-Of-Memory) will be invoked.
+
+Since the victim selection is a heuristic algorithm, we cannot select
+the "perfect" victim. Therefore, currently, the victim will be selected
+from all the cpusets that have the same mems_allowed as the cpuset
+which invoked OOM.
+
+Cpuset oom works in both cgroup v1 and v2.
 
-1.9 How do I use cpusets ?
+1.10 How do I use cpusets ?
 --------------------------
 
 In order to minimize the impact of cpusets on critical kernel
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index f67c0829350b..594aa71cf441 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2199,6 +2199,10 @@  Cpuset Interface Files
 	a need to change "cpuset.mems" with active tasks, it shouldn't
 	be done frequently.
 
+	When a process invokes oom due to the constraint of cpuset.mems,
+	the victim will be selected from cpusets with the same
+	mems_allowed as the current one.
+
   cpuset.mems.effective
 	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 980b76a1237e..75465bf58f74 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -171,6 +171,8 @@  static inline void set_mems_allowed(nodemask_t nodemask)
 	task_unlock(current);
 }
 
+int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg);
+
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
@@ -287,6 +289,10 @@  static inline bool read_mems_allowed_retry(unsigned int seq)
 	return false;
 }
 
+static inline int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
+{
+	return 0;
+}
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index bc4dcfd7bee5..cb6b49245e18 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4013,6 +4013,49 @@  void cpuset_print_current_mems_allowed(void)
 	rcu_read_unlock();
 }
 
+/**
+ * cpuset_scan_tasks - specify the oom scan range
+ * @fn: callback function to select oom victim
+ * @arg: argument for callback function, usually a pointer to struct oom_control
+ *
+ * Description: This function is used to specify the oom scan range. Return 0 if
+ * no task is selected, otherwise return 1. The selected task will be stored in
+ * arg->chosen. This function can only be called in cpuset oom context.
+ *
+ * The selection algorithm is heuristic, therefore requires constant iteration
+ * based on user feedback. Currently, we just iterate through all cpusets with
+ * the same mems_allowed as the current cpuset.
+ */
+int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
+{
+	int ret = 0;
+	struct css_task_iter it;
+	struct task_struct *task;
+	struct cpuset *cs;
+	struct cgroup_subsys_state *pos_css;
+
+	/*
+	 * Situation gets complex with overlapping nodemasks in different cpusets.
+	 * TODO: Maybe we should calculate the "distance" between different mems_allowed.
+	 *
+	 * But for now, let's make it simple. Just iterate through all cpusets
+	 * with the same mems_allowed as the current cpuset.
+	 */
+	cpuset_read_lock();
+	rcu_read_lock();
+	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+		if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
+			css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
+			while (!ret && (task = css_task_iter_next(&it)))
+				ret = fn(task, arg);
+			css_task_iter_end(&it);
+		}
+	}
+	rcu_read_unlock();
+	cpuset_read_unlock();
+	return ret;
+}
+
 /*
  * Collection of memory_pressure is suppressed unless
  * this flag is enabled by writing "1" to the special
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 044e1eed720e..228257788d9e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -367,6 +367,8 @@  static void select_bad_process(struct oom_control *oc)
 
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
+	else if (oc->constraint == CONSTRAINT_CPUSET)
+		cpuset_scan_tasks(oom_evaluate_task, oc);
 	else {
 		struct task_struct *p;
 
@@ -427,6 +429,8 @@  static void dump_tasks(struct oom_control *oc)
 
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
+	else if (oc->constraint == CONSTRAINT_CPUSET)
+		cpuset_scan_tasks(dump_task, oc);
 	else {
 		struct task_struct *p;