diff mbox series

[v3] mm/page_alloc: detect allocation forbidden by cpuset and bail out early

Message ID 1631590828-25565-1-git-send-email-feng.tang@intel.com (mailing list archive)
State New
Headers show
Series [v3] mm/page_alloc: detect allocation forbidden by cpuset and bail out early | expand

Commit Message

Feng Tang Sept. 14, 2021, 3:40 a.m. UTC
There was report that starting an Ubuntu in docker while using cpuset
to bind it to movable nodes (a node only has movable zone, like a node
for hotplug or a Persistent Memory  node in normal usage) will fail
due to memory allocation failure, and then OOM is involved and many
other innocent processes got killed. It can be reproduced with command:
$docker run -it --rm  --cpuset-mems 4 ubuntu:latest bash -c
"grep Mems_allowed /proc/self/status" (node 4 is a movable node)

  runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
  CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
  Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
  Call Trace:
   dump_stack+0x6b/0x88
   dump_header+0x4a/0x1e2
   oom_kill_process.cold+0xb/0x10
   out_of_memory.part.0+0xaf/0x230
   out_of_memory+0x3d/0x80
   __alloc_pages_slowpath.constprop.0+0x954/0xa20
   __alloc_pages_nodemask+0x2d3/0x300
   pipe_write+0x322/0x590
   new_sync_write+0x196/0x1b0
   vfs_write+0x1c3/0x1f0
   ksys_write+0xa7/0xe0
   do_syscall_64+0x52/0xd0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Mem-Info:
  active_anon:392832 inactive_anon:182 isolated_anon:0
   active_file:68130 inactive_file:151527 isolated_file:0
   unevictable:2701 dirty:0 writeback:7
   slab_reclaimable:51418 slab_unreclaimable:116300
   mapped:45825 shmem:735 pagetables:2540 bounce:0
   free:159849484 free_pcp:73 free_cma:0
  Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
  Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0 0
  Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB

  oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
  Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
  oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
  oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The reason is, in the case, the target cpuset nodes only have movable
zone, while the creation of an OS in docker sometimes needs to allocate
memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and
the cpuset limit forbids the allocation, then out-of-memory killing is
involved even when normal nodes and movable nodes both have many free
memory.

The OOM killer cannot help to resolve the situation as there is no
usable memory for the request in the cpuset scope. The only reasonable
measure to take is to fail the allocation right away and have the caller
to deal with it.

So add a check for cases like this in the slowpath of allocation, and
bail out early returning NULL for the allocation.

As page allocation is one of the hottest path in kernel, this check
will hurt all users with sane cpuset configuration, add a static branch
check and detect the abnormal config in cpuset memory binding setup so
that the extra check in page allocation is not paid by everyone.

[thanks to Micho Hocko and David Rientjes for suggesting not handle
 it inside OOM code, adding cpuset check, refining comments]

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
Changelog:
  
  v3:
  * refine the movable_only_nodes() and the nodemask check
    in cpuset code (Michal Hocko)
  * fix a compiling problem (0day test robot)

  v2:
  * add a static branch detection in cpuset code to reduce
    the overhead in allocation hotpath (Michal Hocko)

  v1 (since RFC):
  * move the handling from oom code to page allocation 
    path (Michal/David)

 include/linux/cpuset.h | 17 +++++++++++++++++
 include/linux/mmzone.h | 16 ++++++++++++++++
 kernel/cgroup/cpuset.c | 15 +++++++++++++++
 mm/page_alloc.c        | 13 +++++++++++++
 4 files changed, 61 insertions(+)

Comments

Vlastimil Babka Sept. 14, 2021, 8:01 a.m. UTC | #1
On 9/14/21 05:40, Feng Tang wrote:
> There was report that starting an Ubuntu in docker while using cpuset
> to bind it to movable nodes (a node only has movable zone, like a node
> for hotplug or a Persistent Memory  node in normal usage) will fail
> due to memory allocation failure, and then OOM is involved and many
> other innocent processes got killed. It can be reproduced with command:
> $docker run -it --rm  --cpuset-mems 4 ubuntu:latest bash -c
> "grep Mems_allowed /proc/self/status" (node 4 is a movable node)
> 
>   runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
>   CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
>   Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
>   Call Trace:
>    dump_stack+0x6b/0x88
>    dump_header+0x4a/0x1e2
>    oom_kill_process.cold+0xb/0x10
>    out_of_memory.part.0+0xaf/0x230
>    out_of_memory+0x3d/0x80
>    __alloc_pages_slowpath.constprop.0+0x954/0xa20
>    __alloc_pages_nodemask+0x2d3/0x300
>    pipe_write+0x322/0x590
>    new_sync_write+0x196/0x1b0
>    vfs_write+0x1c3/0x1f0
>    ksys_write+0xa7/0xe0
>    do_syscall_64+0x52/0xd0
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
>   Mem-Info:
>   active_anon:392832 inactive_anon:182 isolated_anon:0
>    active_file:68130 inactive_file:151527 isolated_file:0
>    unevictable:2701 dirty:0 writeback:7
>    slab_reclaimable:51418 slab_unreclaimable:116300
>    mapped:45825 shmem:735 pagetables:2540 bounce:0
>    free:159849484 free_pcp:73 free_cma:0
>   Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
>   Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
>   lowmem_reserve[]: 0 0 0 0 0
>   Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB
> 
>   oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
>   Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
>   oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>   oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>   oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
>   oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>   oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> The reason is, in the case, the target cpuset nodes only have movable
> zone, while the creation of an OS in docker sometimes needs to allocate
> memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and
> the cpuset limit forbids the allocation, then out-of-memory killing is
> involved even when normal nodes and movable nodes both have many free
> memory.
> 
> The OOM killer cannot help to resolve the situation as there is no
> usable memory for the request in the cpuset scope. The only reasonable
> measure to take is to fail the allocation right away and have the caller
> to deal with it.
> 
> So add a check for cases like this in the slowpath of allocation, and
> bail out early returning NULL for the allocation.
> 
> As page allocation is one of the hottest path in kernel, this check
> will hurt all users with sane cpuset configuration, add a static branch
> check and detect the abnormal config in cpuset memory binding setup so
> that the extra check in page allocation is not paid by everyone.
> 
> [thanks to Micho Hocko and David Rientjes for suggesting not handle
>  it inside OOM code, adding cpuset check, refining comments]
> 
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>

...

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 6a1d79d..a455333 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1220,6 +1220,22 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
>  #define for_each_zone_zonelist(zone, z, zlist, highidx) \
>  	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
>  
> +/* Whether the 'nodes' are all movable nodes */
> +static inline bool movable_only_nodes(nodemask_t *nodes)
> +{
> +	struct zonelist *zonelist;
> +	struct zoneref *z;
> +
> +	if (nodes_empty(*nodes))
> +		return false;
> +
> +	zonelist =
> +	    &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK];
> +	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
> +	return (!z->zone) ? true : false;
> +}

Hmm, could all that become just this?

!nodes_intersects(&node_states[N_NORMAL_MEMORY], nodes)

> +
> +
>  #ifdef CONFIG_SPARSEMEM
>  #include <asm/sparsemem.h>
>  #endif
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index df1ccf4..7fa633e 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -69,6 +69,13 @@
>  DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
>  DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
>  
> +/*
> + * There could be abnormal cpuset configurations for cpu or memory
> + * node binding, add this key to provide a quick low-cost judgement
> + * of the situation.
> + */
> +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key);
> +
>  /* See "Frequency meter" comments, below. */
>  
>  struct fmeter {
> @@ -1868,6 +1875,14 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
>  	if (retval < 0)
>  		goto done;
>  
> +	if (!cpusets_insane_config() &&
> +		movable_only_nodes(&trialcs->mems_allowed)) {
> +		static_branch_enable(&cpusets_insane_config_key);
> +		pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! "
> +			"Cpuset allocations might fail even with a lot of memory available.\n",
> +			nodemask_pr_args(&trialcs->mems_allowed));
> +	}
> +
>  	spin_lock_irq(&callback_lock);
>  	cs->mems_allowed = trialcs->mems_allowed;
>  	spin_unlock_irq(&callback_lock);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b37435c..a7e0854 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4914,6 +4914,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (!ac->preferred_zoneref->zone)
>  		goto nopage;
>  
> +	/*
> +	 * Check for insane configurations where the cpuset doesn't contain
> +	 * any suitable zone to satisfy the request - e.g. non-movable
> +	 * GFP_HIGHUSER allocations from MOVABLE nodes only.
> +	 */
> +	if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) {
> +		struct zoneref *z = first_zones_zonelist(ac->zonelist,
> +					ac->highest_zoneidx,
> +					&cpuset_current_mems_allowed);
> +		if (!z->zone)
> +			goto nopage;
> +	}
> +
>  	if (alloc_flags & ALLOC_KSWAPD)
>  		wake_all_kswapds(order, gfp_mask, ac);
>  
>
Michal Hocko Sept. 14, 2021, 8:17 a.m. UTC | #2
On Tue 14-09-21 10:01:26, Vlastimil Babka wrote:
> On 9/14/21 05:40, Feng Tang wrote:
[...]
> > +/* Whether the 'nodes' are all movable nodes */
> > +static inline bool movable_only_nodes(nodemask_t *nodes)
> > +{
> > +	struct zonelist *zonelist;
> > +	struct zoneref *z;
> > +
> > +	if (nodes_empty(*nodes))
> > +		return false;
> > +
> > +	zonelist =
> > +	    &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK];
> > +	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
> > +	return (!z->zone) ? true : false;
> > +}
> 
> Hmm, could all that become just this?
> 
> !nodes_intersects(&node_states[N_NORMAL_MEMORY], nodes)

Maybe yes but I find the zonelist approach much easier to follow even
though the code looks more complex at first sight. It talks about an
empty zone list for ZONE_NORMAL request which is quite clear from the
scribble. I always have to re-learn how the N*MEMORY works TBH. Maybe
this is just me though.
Michal Hocko Sept. 14, 2021, 8:50 a.m. UTC | #3
On Tue 14-09-21 11:40:28, Feng Tang wrote:
> There was report that starting an Ubuntu in docker while using cpuset
> to bind it to movable nodes (a node only has movable zone, like a node
> for hotplug or a Persistent Memory  node in normal usage) will fail
> due to memory allocation failure, and then OOM is involved and many
> other innocent processes got killed. It can be reproduced with command:
> $docker run -it --rm  --cpuset-mems 4 ubuntu:latest bash -c
> "grep Mems_allowed /proc/self/status" (node 4 is a movable node)
> 
>   runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
>   CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
>   Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
>   Call Trace:
>    dump_stack+0x6b/0x88
>    dump_header+0x4a/0x1e2
>    oom_kill_process.cold+0xb/0x10
>    out_of_memory.part.0+0xaf/0x230
>    out_of_memory+0x3d/0x80
>    __alloc_pages_slowpath.constprop.0+0x954/0xa20
>    __alloc_pages_nodemask+0x2d3/0x300
>    pipe_write+0x322/0x590
>    new_sync_write+0x196/0x1b0
>    vfs_write+0x1c3/0x1f0
>    ksys_write+0xa7/0xe0
>    do_syscall_64+0x52/0xd0
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
>   Mem-Info:
>   active_anon:392832 inactive_anon:182 isolated_anon:0
>    active_file:68130 inactive_file:151527 isolated_file:0
>    unevictable:2701 dirty:0 writeback:7
>    slab_reclaimable:51418 slab_unreclaimable:116300
>    mapped:45825 shmem:735 pagetables:2540 bounce:0
>    free:159849484 free_pcp:73 free_cma:0
>   Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
>   Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
>   lowmem_reserve[]: 0 0 0 0 0
>   Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB
> 
>   oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
>   Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
>   oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>   oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>   oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
>   oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>   oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> The reason is, in the case, the target cpuset nodes only have movable
> zone, while the creation of an OS in docker sometimes needs to allocate
> memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and
> the cpuset limit forbids the allocation, then out-of-memory killing is
> involved even when normal nodes and movable nodes both have many free
> memory.
> 
> The OOM killer cannot help to resolve the situation as there is no
> usable memory for the request in the cpuset scope. The only reasonable
> measure to take is to fail the allocation right away and have the caller
> to deal with it.
> 
> So add a check for cases like this in the slowpath of allocation, and
> bail out early returning NULL for the allocation.
> 
> As page allocation is one of the hottest path in kernel, this check
> will hurt all users with sane cpuset configuration, add a static branch
> check and detect the abnormal config in cpuset memory binding setup so
> that the extra check in page allocation is not paid by everyone.
> 
> [thanks to Micho Hocko and David Rientjes for suggesting not handle
>  it inside OOM code, adding cpuset check, refining comments]
> 
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Minor nit below
[...]
> +/* Whether the 'nodes' are all movable nodes */
> +static inline bool movable_only_nodes(nodemask_t *nodes)
> +{
> +	struct zonelist *zonelist;
> +	struct zoneref *z;
> +
> +	if (nodes_empty(*nodes))
> +		return false;
> +
> +	zonelist =
> +	    &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK];
> +	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
> +	return (!z->zone) ? true : false;

This would read easier to me
	/*
	 * We can chose arbitrary node from the nodemask to get a
	 * zonelist as they are interlinked. We just need to find
	 * at least one zone that can satisfy kernel allocations.
	 */
	node = NODE_DATA(first_node(*nodes));
	zonelist = node_zonelist(node, GFP_KERNEL);
	z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes);
David Rientjes Sept. 15, 2021, 12:30 a.m. UTC | #4
On Tue, 14 Sep 2021, Feng Tang wrote:

> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index d2b9c41..d58e047 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -34,6 +34,8 @@
>   */
>  extern struct static_key_false cpusets_pre_enable_key;
>  extern struct static_key_false cpusets_enabled_key;
> +extern struct static_key_false cpusets_insane_config_key;
> +
>  static inline bool cpusets_enabled(void)
>  {
>  	return static_branch_unlikely(&cpusets_enabled_key);
> @@ -51,6 +53,19 @@ static inline void cpuset_dec(void)
>  	static_branch_dec_cpuslocked(&cpusets_pre_enable_key);
>  }
>  
> +/*
> + * This will get enabled whenever a cpuset configuration is considered
> + * unsupportable in general. E.g. movable only node which cannot satisfy
> + * any non movable allocations (see update_nodemask). Page allocator
> + * needs to make additional checks for those configurations and this
> + * check is meant to guard those checks without any overhead for sane
> + * configurations.
> + */
> +static inline bool cpusets_insane_config(void)
> +{
> +	return static_branch_unlikely(&cpusets_insane_config_key);
> +}
> +
>  extern int cpuset_init(void);
>  extern void cpuset_init_smp(void);
>  extern void cpuset_force_rebuild(void);
> @@ -167,6 +182,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>  
>  static inline bool cpusets_enabled(void) { return false; }
>  
> +static inline bool cpusets_insane_config(void) { return false; }
> +
>  static inline int cpuset_init(void) { return 0; }
>  static inline void cpuset_init_smp(void) {}
>  
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 6a1d79d..a455333 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1220,6 +1220,22 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
>  #define for_each_zone_zonelist(zone, z, zlist, highidx) \
>  	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
>  
> +/* Whether the 'nodes' are all movable nodes */
> +static inline bool movable_only_nodes(nodemask_t *nodes)
> +{
> +	struct zonelist *zonelist;
> +	struct zoneref *z;
> +
> +	if (nodes_empty(*nodes))
> +		return false;
> +
> +	zonelist =
> +	    &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK];
> +	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
> +	return (!z->zone) ? true : false;
> +}
> +
> +
>  #ifdef CONFIG_SPARSEMEM
>  #include <asm/sparsemem.h>
>  #endif
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index df1ccf4..7fa633e 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -69,6 +69,13 @@
>  DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
>  DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
>  
> +/*
> + * There could be abnormal cpuset configurations for cpu or memory
> + * node binding, add this key to provide a quick low-cost judgement
> + * of the situation.
> + */
> +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key);
> +
>  /* See "Frequency meter" comments, below. */
>  
>  struct fmeter {
> @@ -1868,6 +1875,14 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
>  	if (retval < 0)
>  		goto done;
>  
> +	if (!cpusets_insane_config() &&
> +		movable_only_nodes(&trialcs->mems_allowed)) {
> +		static_branch_enable(&cpusets_insane_config_key);
> +		pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! "
> +			"Cpuset allocations might fail even with a lot of memory available.\n",
> +			nodemask_pr_args(&trialcs->mems_allowed));
> +	}
> +
>  	spin_lock_irq(&callback_lock);
>  	cs->mems_allowed = trialcs->mems_allowed;
>  	spin_unlock_irq(&callback_lock);

Is this the only time that the state of the nodemask may change?

I'm wondering about a single node nodemask, for example, where all 
ZONE_NORMAL memory is hot-removed.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b37435c..a7e0854 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4914,6 +4914,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (!ac->preferred_zoneref->zone)
>  		goto nopage;
>  
> +	/*
> +	 * Check for insane configurations where the cpuset doesn't contain
> +	 * any suitable zone to satisfy the request - e.g. non-movable
> +	 * GFP_HIGHUSER allocations from MOVABLE nodes only.
> +	 */
> +	if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) {
> +		struct zoneref *z = first_zones_zonelist(ac->zonelist,
> +					ac->highest_zoneidx,
> +					&cpuset_current_mems_allowed);
> +		if (!z->zone)
> +			goto nopage;
> +	}
> +
>  	if (alloc_flags & ALLOC_KSWAPD)
>  		wake_all_kswapds(order, gfp_mask, ac);
>
Feng Tang Sept. 15, 2021, 5:32 a.m. UTC | #5
On Tue, Sep 14, 2021 at 05:30:03PM -0700, David Rientjes wrote:
> On Tue, 14 Sep 2021, Feng Tang wrote:
> 
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index d2b9c41..d58e047 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -34,6 +34,8 @@
> >   */
> >  extern struct static_key_false cpusets_pre_enable_key;
> >  extern struct static_key_false cpusets_enabled_key;
> > +extern struct static_key_false cpusets_insane_config_key;
> > +
> >  static inline bool cpusets_enabled(void)
> >  {
> >  	return static_branch_unlikely(&cpusets_enabled_key);
> > @@ -51,6 +53,19 @@ static inline void cpuset_dec(void)
> >  	static_branch_dec_cpuslocked(&cpusets_pre_enable_key);
> >  }
> >  
> > +/*
> > + * This will get enabled whenever a cpuset configuration is considered
> > + * unsupportable in general. E.g. movable only node which cannot satisfy
> > + * any non movable allocations (see update_nodemask). Page allocator
> > + * needs to make additional checks for those configurations and this
> > + * check is meant to guard those checks without any overhead for sane
> > + * configurations.
> > + */
> > +static inline bool cpusets_insane_config(void)
> > +{
> > +	return static_branch_unlikely(&cpusets_insane_config_key);
> > +}
> > +
> >  extern int cpuset_init(void);
> >  extern void cpuset_init_smp(void);
> >  extern void cpuset_force_rebuild(void);
> > @@ -167,6 +182,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> >  
> >  static inline bool cpusets_enabled(void) { return false; }
> >  
> > +static inline bool cpusets_insane_config(void) { return false; }
> > +
> >  static inline int cpuset_init(void) { return 0; }
> >  static inline void cpuset_init_smp(void) {}
> >  
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 6a1d79d..a455333 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1220,6 +1220,22 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
> >  #define for_each_zone_zonelist(zone, z, zlist, highidx) \
> >  	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
> >  
> > +/* Whether the 'nodes' are all movable nodes */
> > +static inline bool movable_only_nodes(nodemask_t *nodes)
> > +{
> > +	struct zonelist *zonelist;
> > +	struct zoneref *z;
> > +
> > +	if (nodes_empty(*nodes))
> > +		return false;
> > +
> > +	zonelist =
> > +	    &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK];
> > +	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
> > +	return (!z->zone) ? true : false;
> > +}
> > +
> > +
> >  #ifdef CONFIG_SPARSEMEM
> >  #include <asm/sparsemem.h>
> >  #endif
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index df1ccf4..7fa633e 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -69,6 +69,13 @@
> >  DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
> >  DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
> >  
> > +/*
> > + * There could be abnormal cpuset configurations for cpu or memory
> > + * node binding, add this key to provide a quick low-cost judgement
> > + * of the situation.
> > + */
> > +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key);
> > +
> >  /* See "Frequency meter" comments, below. */
> >  
> >  struct fmeter {
> > @@ -1868,6 +1875,14 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
> >  	if (retval < 0)
> >  		goto done;
> >  
> > +	if (!cpusets_insane_config() &&
> > +		movable_only_nodes(&trialcs->mems_allowed)) {
> > +		static_branch_enable(&cpusets_insane_config_key);
> > +		pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! "
> > +			"Cpuset allocations might fail even with a lot of memory available.\n",
> > +			nodemask_pr_args(&trialcs->mems_allowed));
> > +	}
> > +
> >  	spin_lock_irq(&callback_lock);
> >  	cs->mems_allowed = trialcs->mems_allowed;
> >  	spin_unlock_irq(&callback_lock);
> 
> Is this the only time that the state of the nodemask may change?
> 
> I'm wondering about a single node nodemask, for example, where all 
> ZONE_NORMAL memory is hot-removed.

Thanks for the reminding! Yes, memory hot remove can change the
cpuset's effective nodemask, we may need to add similar check inside
cpuset_hotplug_update_tasks() which is called by cpuset_hotplug_workfn(), 
something like below?

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 7fa633e..d5f6776 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3186,6 +3186,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 	cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
 	mems_updated = !nodes_equal(new_mems, cs->effective_mems);
 
+	if (mems_updated && !cpusets_insane_config() &&
+		movable_only_nodes(new_mems)) {
+		static_branch_enable(&cpusets_insane_config_key);
+		pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl) after memory hotplug."
+			"Cpuset allocations might fail even with a lot of memory available.\n",
+			nodemask_pr_args(new_mems);
+	}
+
 	if (is_in_v2_mode())
 		hotplug_update_tasks(cs, &new_cpus, &new_mems,
 				     cpus_updated, mems_updated);

Thanks,
Feng
Michal Hocko Sept. 15, 2021, 11:30 a.m. UTC | #6
On Wed 15-09-21 13:32:47, Feng Tang wrote:
> On Tue, Sep 14, 2021 at 05:30:03PM -0700, David Rientjes wrote:
[...]
> > I'm wondering about a single node nodemask, for example, where all 
> > ZONE_NORMAL memory is hot-removed.

While this is theoretically possible it is highly unlikely to happen.
Non movable memory just takes one kernel allocation to prevent any
hotremove operation to finish. I have to say I was not aware of the
hotplug callback. It all seems rather suspicious. I will have a look.

Anyway something worth having covered "just in case". Thanks for
pointing it out.
 
> Thanks for the reminding! Yes, memory hot remove can change the
> cpuset's effective nodemask, we may need to add similar check inside
> cpuset_hotplug_update_tasks() which is called by cpuset_hotplug_workfn(), 
> something like below?
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 7fa633e..d5f6776 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3186,6 +3186,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
>  	cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
>  	mems_updated = !nodes_equal(new_mems, cs->effective_mems);
>  
> +	if (mems_updated && !cpusets_insane_config() &&
> +		movable_only_nodes(new_mems)) {
> +		static_branch_enable(&cpusets_insane_config_key);
> +		pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl) after memory hotplug."
> +			"Cpuset allocations might fail even with a lot of memory available.\n",
> +			nodemask_pr_args(new_mems);
> +	}

Please create a helper rather than two copies of the same. Thanks!
> +
>  	if (is_in_v2_mode())
>  		hotplug_update_tasks(cs, &new_cpus, &new_mems,
>  				     cpus_updated, mems_updated);
> 
> Thanks,
> Feng
Feng Tang Sept. 16, 2021, 8:11 a.m. UTC | #7
On Wed, Sep 15, 2021 at 01:30:27PM +0200, Michal Hocko wrote:
> On Wed 15-09-21 13:32:47, Feng Tang wrote:
> > On Tue, Sep 14, 2021 at 05:30:03PM -0700, David Rientjes wrote:
> [...]
> > > I'm wondering about a single node nodemask, for example, where all 
> > > ZONE_NORMAL memory is hot-removed.
> 
> While this is theoretically possible it is highly unlikely to happen.
> Non movable memory just takes one kernel allocation to prevent any
> hotremove operation to finish. I have to say I was not aware of the
> hotplug callback. It all seems rather suspicious. I will have a look.
> 
> Anyway something worth having covered "just in case". Thanks for
> pointing it out.
>  
> > Thanks for the reminding! Yes, memory hot remove can change the
> > cpuset's effective nodemask, we may need to add similar check inside
> > cpuset_hotplug_update_tasks() which is called by cpuset_hotplug_workfn(), 
> > something like below?
> > 
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 7fa633e..d5f6776 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -3186,6 +3186,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
> >  	cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
> >  	mems_updated = !nodes_equal(new_mems, cs->effective_mems);
> >  
> > +	if (mems_updated && !cpusets_insane_config() &&
> > +		movable_only_nodes(new_mems)) {
> > +		static_branch_enable(&cpusets_insane_config_key);
> > +		pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl) after memory hotplug."
> > +			"Cpuset allocations might fail even with a lot of memory available.\n",
> > +			nodemask_pr_args(new_mems);
> > +	}
> 
> Please create a helper rather than two copies of the same. Thanks!

Sure. Some draft add-on patch below.

Thanks,
Feng

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 7fa633e..3bb9f4ea 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -391,6 +391,18 @@ static inline bool is_in_v2_mode(void)
 	      (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
 }
 
+static inline void check_insane_mems_config(nodemask_t *nodes)
+{
+	if (!cpusets_insane_config() &&
+		movable_only_nodes(nodes)) {
+		static_branch_enable(&cpusets_insane_config_key);
+		pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! "
+			"Cpuset allocations might fail even with a lot of memory available.\n",
+			nodemask_pr_args(nodes));
+	}
+}
+
 /*
  * Return in pmask the portion of a task's cpusets's cpus_allowed that
  * are online and are capable of running the task.  If none are found,
@@ -1875,13 +1887,7 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		goto done;
 
-	if (!cpusets_insane_config() &&
-		movable_only_nodes(&trialcs->mems_allowed)) {
-		static_branch_enable(&cpusets_insane_config_key);
-		pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! "
-			"Cpuset allocations might fail even with a lot of memory available.\n",
-			nodemask_pr_args(&trialcs->mems_allowed));
-	}
+	check_insane_mems_config(&trialcs->mems_allowed);
 
 	spin_lock_irq(&callback_lock);
 	cs->mems_allowed = trialcs->mems_allowed;
@@ -3186,6 +3192,9 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 	cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
 	mems_updated = !nodes_equal(new_mems, cs->effective_mems);
 
+	if (mems_updated)
+		check_insane_mems_config(&new_mems);
+
 	if (is_in_v2_mode())
 		hotplug_update_tasks(cs, &new_cpus, &new_mems,
 				     cpus_updated, mems_updated);
Feng Tang Sept. 24, 2021, 6:10 a.m. UTC | #8
Hi Michal,

On Tue, Sep 14, 2021 at 10:50:42AM +0200, Michal Hocko wrote:
> On Tue 14-09-21 11:40:28, Feng Tang wrote:
[SPIN]
> > The OOM killer cannot help to resolve the situation as there is no
> > usable memory for the request in the cpuset scope. The only reasonable
> > measure to take is to fail the allocation right away and have the caller
> > to deal with it.
> > 
> > So add a check for cases like this in the slowpath of allocation, and
> > bail out early returning NULL for the allocation.
> > 
> > As page allocation is one of the hottest path in kernel, this check
> > will hurt all users with sane cpuset configuration, add a static branch
> > check and detect the abnormal config in cpuset memory binding setup so
> > that the extra check in page allocation is not paid by everyone.
> > 
> > [thanks to Micho Hocko and David Rientjes for suggesting not handle
> >  it inside OOM code, adding cpuset check, refining comments]
> > 
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
 
Thank you!

> Minor nit below
> [...]
> > +/* Whether the 'nodes' are all movable nodes */
> > +static inline bool movable_only_nodes(nodemask_t *nodes)
> > +{
> > +	struct zonelist *zonelist;
> > +	struct zoneref *z;
> > +
> > +	if (nodes_empty(*nodes))
> > +		return false;
> > +
> > +	zonelist =
> > +	    &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK];
> > +	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
> > +	return (!z->zone) ? true : false;
> 
> This would read easier to me
> 	/*
> 	 * We can chose arbitrary node from the nodemask to get a
> 	 * zonelist as they are interlinked. We just need to find
> 	 * at least one zone that can satisfy kernel allocations.
> 	 */
> 	node = NODE_DATA(first_node(*nodes));
> 	zonelist = node_zonelist(node, GFP_KERNEL);
> 	z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes);

When working on the v4 patch, I see some compile warning
that 'node_zonelist()' and 'GFP_KERNEL' are either implicit
or undeclared, as they are from "gfp.h".

So we may need to move this function to gfp.h or keep the
current code with slight modification?

	nid = first_node(*nodes);
	zonelist = &NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK];
	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
	return (!z->zone) ? true : false;

Thanks,
Feng
Michal Hocko Sept. 24, 2021, 7:17 a.m. UTC | #9
On Fri 24-09-21 14:10:54, Feng Tang wrote:
> Hi Michal,
> 
> On Tue, Sep 14, 2021 at 10:50:42AM +0200, Michal Hocko wrote:
> > On Tue 14-09-21 11:40:28, Feng Tang wrote:
> [SPIN]
> > > The OOM killer cannot help to resolve the situation as there is no
> > > usable memory for the request in the cpuset scope. The only reasonable
> > > measure to take is to fail the allocation right away and have the caller
> > > to deal with it.
> > > 
> > > So add a check for cases like this in the slowpath of allocation, and
> > > bail out early returning NULL for the allocation.
> > > 
> > > As page allocation is one of the hottest path in kernel, this check
> > > will hurt all users with sane cpuset configuration, add a static branch
> > > check and detect the abnormal config in cpuset memory binding setup so
> > > that the extra check in page allocation is not paid by everyone.
> > > 
> > > [thanks to Micho Hocko and David Rientjes for suggesting not handle
> > >  it inside OOM code, adding cpuset check, refining comments]
> > > 
> > > Suggested-by: Michal Hocko <mhocko@suse.com>
> > > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > 
> > Acked-by: Michal Hocko <mhocko@suse.com>
>  
> Thank you!
> 
> > Minor nit below
> > [...]
> > > +/* Whether the 'nodes' are all movable nodes */
> > > +static inline bool movable_only_nodes(nodemask_t *nodes)
> > > +{
> > > +	struct zonelist *zonelist;
> > > +	struct zoneref *z;
> > > +
> > > +	if (nodes_empty(*nodes))
> > > +		return false;
> > > +
> > > +	zonelist =
> > > +	    &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK];
> > > +	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
> > > +	return (!z->zone) ? true : false;
> > 
> > This would read easier to me
> > 	/*
> > 	 * We can chose arbitrary node from the nodemask to get a
> > 	 * zonelist as they are interlinked. We just need to find
> > 	 * at least one zone that can satisfy kernel allocations.
> > 	 */
> > 	node = NODE_DATA(first_node(*nodes));
> > 	zonelist = node_zonelist(node, GFP_KERNEL);
> > 	z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes);
> 
> When working on the v4 patch, I see some compile warning
> that 'node_zonelist()' and 'GFP_KERNEL' are either implicit
> or undeclared, as they are from "gfp.h".
> 
> So we may need to move this function to gfp.h or keep the
> current code with slight modification?
> 
> 	nid = first_node(*nodes);
> 	zonelist = &NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK];
> 	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
> 	return (!z->zone) ? true : false;

I would put it into gfp.h but I can see how this might be not really
loved there. Both ways work with me.
diff mbox series

Patch

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index d2b9c41..d58e047 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -34,6 +34,8 @@ 
  */
 extern struct static_key_false cpusets_pre_enable_key;
 extern struct static_key_false cpusets_enabled_key;
+extern struct static_key_false cpusets_insane_config_key;
+
 static inline bool cpusets_enabled(void)
 {
 	return static_branch_unlikely(&cpusets_enabled_key);
@@ -51,6 +53,19 @@  static inline void cpuset_dec(void)
 	static_branch_dec_cpuslocked(&cpusets_pre_enable_key);
 }
 
+/*
+ * This will get enabled whenever a cpuset configuration is considered
+ * unsupportable in general. E.g. movable only node which cannot satisfy
+ * any non movable allocations (see update_nodemask). Page allocator
+ * needs to make additional checks for those configurations and this
+ * check is meant to guard those checks without any overhead for sane
+ * configurations.
+ */
+static inline bool cpusets_insane_config(void)
+{
+	return static_branch_unlikely(&cpusets_insane_config_key);
+}
+
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
@@ -167,6 +182,8 @@  static inline void set_mems_allowed(nodemask_t nodemask)
 
 static inline bool cpusets_enabled(void) { return false; }
 
+static inline bool cpusets_insane_config(void) { return false; }
+
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6a1d79d..a455333 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1220,6 +1220,22 @@  static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
 #define for_each_zone_zonelist(zone, z, zlist, highidx) \
 	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
 
+/* Whether the 'nodes' are all movable nodes */
+static inline bool movable_only_nodes(nodemask_t *nodes)
+{
+	struct zonelist *zonelist;
+	struct zoneref *z;
+
+	if (nodes_empty(*nodes))
+		return false;
+
+	zonelist =
+	    &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK];
+	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
+	return (!z->zone) ? true : false;
+}
+
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 #endif
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index df1ccf4..7fa633e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -69,6 +69,13 @@ 
 DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
 DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
 
+/*
+ * There could be abnormal cpuset configurations for cpu or memory
+ * node binding, add this key to provide a quick low-cost judgement
+ * of the situation.
+ */
+DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key);
+
 /* See "Frequency meter" comments, below. */
 
 struct fmeter {
@@ -1868,6 +1875,14 @@  static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		goto done;
 
+	if (!cpusets_insane_config() &&
+		movable_only_nodes(&trialcs->mems_allowed)) {
+		static_branch_enable(&cpusets_insane_config_key);
+		pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! "
+			"Cpuset allocations might fail even with a lot of memory available.\n",
+			nodemask_pr_args(&trialcs->mems_allowed));
+	}
+
 	spin_lock_irq(&callback_lock);
 	cs->mems_allowed = trialcs->mems_allowed;
 	spin_unlock_irq(&callback_lock);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c..a7e0854 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4914,6 +4914,19 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (!ac->preferred_zoneref->zone)
 		goto nopage;
 
+	/*
+	 * Check for insane configurations where the cpuset doesn't contain
+	 * any suitable zone to satisfy the request - e.g. non-movable
+	 * GFP_HIGHUSER allocations from MOVABLE nodes only.
+	 */
+	if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) {
+		struct zoneref *z = first_zones_zonelist(ac->zonelist,
+					ac->highest_zoneidx,
+					&cpuset_current_mems_allowed);
+		if (!z->zone)
+			goto nopage;
+	}
+
 	if (alloc_flags & ALLOC_KSWAPD)
 		wake_all_kswapds(order, gfp_mask, ac);