diff mbox series

[v2] mm/page_alloc: detect allocation forbidden by cpuset and bail out early

Message ID 1631518709-42881-1-git-send-email-feng.tang@intel.com (mailing list archive)
State New
Headers show
Series [v2] mm/page_alloc: detect allocation forbidden by cpuset and bail out early | expand

Commit Message

Feng Tang Sept. 13, 2021, 7:38 a.m. UTC
There was report that starting an Ubuntu in docker while using cpuset
to bind it to movable nodes (a node only has movable zone, like a node
for hotplug or a Persistent Memory  node in normal usage) will fail
due to memory allocation failure, and then OOM is involved and many
other innocent processes got killed. It can be reproduced with command:
$docker run -it --rm  --cpuset-mems 4 ubuntu:latest bash -c
"grep Mems_allowed /proc/self/status" (node 4 is a movable node)

  
  runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
  CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
  Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
  Call Trace:
   dump_stack+0x6b/0x88
   dump_header+0x4a/0x1e2
   oom_kill_process.cold+0xb/0x10
   out_of_memory.part.0+0xaf/0x230
   out_of_memory+0x3d/0x80
   __alloc_pages_slowpath.constprop.0+0x954/0xa20
   __alloc_pages_nodemask+0x2d3/0x300
   pipe_write+0x322/0x590
   new_sync_write+0x196/0x1b0
   vfs_write+0x1c3/0x1f0
   ksys_write+0xa7/0xe0
   do_syscall_64+0x52/0xd0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  
  Mem-Info:
  active_anon:392832 inactive_anon:182 isolated_anon:0
   active_file:68130 inactive_file:151527 isolated_file:0
   unevictable:2701 dirty:0 writeback:7
   slab_reclaimable:51418 slab_unreclaimable:116300
   mapped:45825 shmem:735 pagetables:2540 bounce:0
   free:159849484 free_pcp:73 free_cma:0
  Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
  Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0 0
  Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB
  
  oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
  Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
  oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
  oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB


The reason is, in the case, the target cpuset nodes only have movable
zone, while the creation of an OS in docker sometimes needs to allocate
memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and
the cpuset limit forbids the allocation, then out-of-memory killing is
involved even when normal nodes and movable nodes both have many free
memory.

The OOM killer cannot help to resolve the situation as there is no
usable memory for the request in the cpuset scope. The only reasonable
measure to take is to fail the allocation right away and have the caller
to deal with it. 

So add a check for cases like this in the slowpath of allocation, and
bail out early returning NULL for the allocation.

As page allocation is one of the hottest path in kernel, this check
will hurt all users with sane cpuset configuration, add a static branch
check and detect the abnormal config in cpuset memory binding setup so
that the extra check in page allocation is not paid by everyone.  

[thanks to Micho Hocko and David Rientjes for suggesting not handle
 it inside OOM code, adding cpuset check, refining comments]

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
Changelog:
 
  v2:
  * add a static branch detection in cpuset code to reduce
    the overhead in allocation hotpath (Michal Hocko)

  v1 (since RFC):
  * move the handling from oom code to page allocation 
    path (Michal/David)

 include/linux/cpuset.h | 15 +++++++++++++++
 include/linux/mmzone.h | 12 ++++++++++++
 kernel/cgroup/cpuset.c | 14 ++++++++++++++
 mm/page_alloc.c        | 13 +++++++++++++
 4 files changed, 54 insertions(+)

Comments

Michal Hocko Sept. 13, 2021, 9:15 a.m. UTC | #1
On Mon 13-09-21 15:38:29, Feng Tang wrote:
> There was report that starting an Ubuntu in docker while using cpuset
> to bind it to movable nodes (a node only has movable zone, like a node
> for hotplug or a Persistent Memory  node in normal usage) will fail
> due to memory allocation failure, and then OOM is involved and many
> other innocent processes got killed. It can be reproduced with command:
> $docker run -it --rm  --cpuset-mems 4 ubuntu:latest bash -c
> "grep Mems_allowed /proc/self/status" (node 4 is a movable node)
> 
>   
>   runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
>   CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
>   Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
>   Call Trace:
>    dump_stack+0x6b/0x88
>    dump_header+0x4a/0x1e2
>    oom_kill_process.cold+0xb/0x10
>    out_of_memory.part.0+0xaf/0x230
>    out_of_memory+0x3d/0x80
>    __alloc_pages_slowpath.constprop.0+0x954/0xa20
>    __alloc_pages_nodemask+0x2d3/0x300
>    pipe_write+0x322/0x590
>    new_sync_write+0x196/0x1b0
>    vfs_write+0x1c3/0x1f0
>    ksys_write+0xa7/0xe0
>    do_syscall_64+0x52/0xd0
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
>   
>   Mem-Info:
>   active_anon:392832 inactive_anon:182 isolated_anon:0
>    active_file:68130 inactive_file:151527 isolated_file:0
>    unevictable:2701 dirty:0 writeback:7
>    slab_reclaimable:51418 slab_unreclaimable:116300
>    mapped:45825 shmem:735 pagetables:2540 bounce:0
>    free:159849484 free_pcp:73 free_cma:0
>   Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
>   Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
>   lowmem_reserve[]: 0 0 0 0 0
>   Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB
>   
>   oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
>   Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
>   oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>   oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>   oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
>   oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>   oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> 
> The reason is, in the case, the target cpuset nodes only have movable
> zone, while the creation of an OS in docker sometimes needs to allocate
> memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and
> the cpuset limit forbids the allocation, then out-of-memory killing is
> involved even when normal nodes and movable nodes both have many free
> memory.
> 
> The OOM killer cannot help to resolve the situation as there is no
> usable memory for the request in the cpuset scope. The only reasonable
> measure to take is to fail the allocation right away and have the caller
> to deal with it. 
> 
> So add a check for cases like this in the slowpath of allocation, and
> bail out early returning NULL for the allocation.
> 
> As page allocation is one of the hottest path in kernel, this check
> will hurt all users with sane cpuset configuration, add a static branch
> check and detect the abnormal config in cpuset memory binding setup so
> that the extra check in page allocation is not paid by everyone.  
> 
> [thanks to Micho Hocko and David Rientjes for suggesting not handle
>  it inside OOM code, adding cpuset check, refining comments]
> 
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
> Changelog:
>  
>   v2:
>   * add a static branch detection in cpuset code to reduce
>     the overhead in allocation hotpath (Michal Hocko)
> 
>   v1 (since RFC):
>   * move the handling from oom code to page allocation 
>     path (Michal/David)
> 
>  include/linux/cpuset.h | 15 +++++++++++++++
>  include/linux/mmzone.h | 12 ++++++++++++
>  kernel/cgroup/cpuset.c | 14 ++++++++++++++
>  mm/page_alloc.c        | 13 +++++++++++++
>  4 files changed, 54 insertions(+)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index d2b9c41..403ccf9 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -34,6 +34,8 @@
>   */
>  extern struct static_key_false cpusets_pre_enable_key;
>  extern struct static_key_false cpusets_enabled_key;
> +extern struct static_key_false cpusets_insane_config_key;
> +
>  static inline bool cpusets_enabled(void)
>  {
>  	return static_branch_unlikely(&cpusets_enabled_key);
> @@ -51,6 +53,19 @@ static inline void cpuset_dec(void)
>  	static_branch_dec_cpuslocked(&cpusets_pre_enable_key);
>  }
>  
> +/*
> + * This will get enabled whenever a cpuset configuration is considered
> + * unsupportable in general. E.g. movable only node which cannot satisfy
> + * any non movable allocations (see update_nodemask). Page allocator
> + * needs to make additional checks for those configurations and this
> + * check is meant to guard those checks without any overhead for sane
> + * configurations.
> + */
> +static inline bool cpusets_insane_config(void)
> +{
> +	return static_branch_unlikely(&cpusets_insane_config_key);
> +}
> +
>  extern int cpuset_init(void);
>  extern void cpuset_init_smp(void);
>  extern void cpuset_force_rebuild(void);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 6a1d79d..b69b871 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1220,6 +1220,18 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
>  #define for_each_zone_zonelist(zone, z, zlist, highidx) \
>  	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
>  
> +/* Whether the 'nodes' are all movable nodes */
> +static inline bool movable_only_nodes(nodemask_t *nodes)
> +{
> +	struct zonelist *zonelist;
> +	struct zoneref *z;
> +
> +	zonelist = &(first_online_pgdat())->node_zonelists[ZONELIST_FALLBACK];

This will work but it just begs a question why you haven't chosen a node
from the given nodemask. So I believe it would be easier to read if you
did
	zonelist = NODE_DATA(first_node(nodes))->node_zonelists[ZONELIST_FALLBACK]

> +	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
> +	return (!z->zone) ? true : false;
> +}
> +
> +
>  #ifdef CONFIG_SPARSEMEM
>  #include <asm/sparsemem.h>
>  #endif
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index df1ccf4..03eb40c 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -69,6 +69,13 @@
>  DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
>  DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
>  
> +/*
> + * There could be abnormal cpuset configurations for cpu or memory
> + * node binding, add this key to provide a quick low-cost judgement
> + * of the situation.
> + */
> +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key);
> +
>  /* See "Frequency meter" comments, below. */
>  
>  struct fmeter {
> @@ -1868,6 +1875,13 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
>  	if (retval < 0)
>  		goto done;
>  
> +	if (movable_only_nodes(&trialcs->mems_allowed)) {

You can skip the check if cpusets_insane_config(). The question is
whether you want to report all potential users or only the first one.

> +		static_branch_enable(&cpusets_insane_config_key);
> +		pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! "
> +			"Cpuset allocations might fail even with a lot of memory available.\n",
> +			nodemask_pr_args(&trialcs->mems_allowed));
> +	}
> +
>  	spin_lock_irq(&callback_lock);
>  	cs->mems_allowed = trialcs->mems_allowed;
>  	spin_unlock_irq(&callback_lock);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b37435c..a7e0854 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4914,6 +4914,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (!ac->preferred_zoneref->zone)
>  		goto nopage;
>  
> +	/*
> +	 * Check for insane configurations where the cpuset doesn't contain
> +	 * any suitable zone to satisfy the request - e.g. non-movable
> +	 * GFP_HIGHUSER allocations from MOVABLE nodes only.
> +	 */
> +	if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) {
> +		struct zoneref *z = first_zones_zonelist(ac->zonelist,
> +					ac->highest_zoneidx,
> +					&cpuset_current_mems_allowed);
> +		if (!z->zone)
> +			goto nopage;
> +	}
> +
>  	if (alloc_flags & ALLOC_KSWAPD)
>  		wake_all_kswapds(order, gfp_mask, ac);

The rest looks sensible to me.
kernel test robot Sept. 13, 2021, 9:40 a.m. UTC | #2
Hi Feng,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-linux-mm/master]

url:    https://github.com/0day-ci/linux/commits/Feng-Tang/mm-page_alloc-detect-allocation-forbidden-by-cpuset-and-bail-out-early/20210913-154016
base:   https://github.com/hnaz/linux-mm master
config: arc-randconfig-r043-20210913 (attached as .config)
compiler: arc-elf-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/276fb2292fa199777b3e9a394c8737e4c618cd23
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Feng-Tang/mm-page_alloc-detect-allocation-forbidden-by-cpuset-and-bail-out-early/20210913-154016
        git checkout 276fb2292fa199777b3e9a394c8737e4c618cd23
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=arc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/page_alloc.c:3810:15: warning: no previous prototype for 'should_fail_alloc_page' [-Wmissing-prototypes]
    3810 | noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
         |               ^~~~~~~~~~~~~~~~~~~~~~
   mm/page_alloc.c: In function '__alloc_pages_slowpath':
>> mm/page_alloc.c:4922:13: error: implicit declaration of function 'cpusets_insane_config' [-Werror=implicit-function-declaration]
    4922 |         if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) {
         |             ^~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/cpusets_insane_config +4922 mm/page_alloc.c

  4868	
  4869	static inline struct page *
  4870	__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  4871							struct alloc_context *ac)
  4872	{
  4873		bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
  4874		const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
  4875		struct page *page = NULL;
  4876		unsigned int alloc_flags;
  4877		unsigned long did_some_progress;
  4878		enum compact_priority compact_priority;
  4879		enum compact_result compact_result;
  4880		int compaction_retries;
  4881		int no_progress_loops;
  4882		unsigned int cpuset_mems_cookie;
  4883		int reserve_flags;
  4884	
  4885		/*
  4886		 * We also sanity check to catch abuse of atomic reserves being used by
  4887		 * callers that are not in atomic context.
  4888		 */
  4889		if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
  4890					(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
  4891			gfp_mask &= ~__GFP_ATOMIC;
  4892	
  4893	retry_cpuset:
  4894		compaction_retries = 0;
  4895		no_progress_loops = 0;
  4896		compact_priority = DEF_COMPACT_PRIORITY;
  4897		cpuset_mems_cookie = read_mems_allowed_begin();
  4898	
  4899		/*
  4900		 * The fast path uses conservative alloc_flags to succeed only until
  4901		 * kswapd needs to be woken up, and to avoid the cost of setting up
  4902		 * alloc_flags precisely. So we do that now.
  4903		 */
  4904		alloc_flags = gfp_to_alloc_flags(gfp_mask);
  4905	
  4906		/*
  4907		 * We need to recalculate the starting point for the zonelist iterator
  4908		 * because we might have used different nodemask in the fast path, or
  4909		 * there was a cpuset modification and we are retrying - otherwise we
  4910		 * could end up iterating over non-eligible zones endlessly.
  4911		 */
  4912		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
  4913						ac->highest_zoneidx, ac->nodemask);
  4914		if (!ac->preferred_zoneref->zone)
  4915			goto nopage;
  4916	
  4917		/*
  4918		 * Check for insane configurations where the cpuset doesn't contain
  4919		 * any suitable zone to satisfy the request - e.g. non-movable
  4920		 * GFP_HIGHUSER allocations from MOVABLE nodes only.
  4921		 */
> 4922		if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) {
  4923			struct zoneref *z = first_zones_zonelist(ac->zonelist,
  4924						ac->highest_zoneidx,
  4925						&cpuset_current_mems_allowed);
  4926			if (!z->zone)
  4927				goto nopage;
  4928		}
  4929	
  4930		if (alloc_flags & ALLOC_KSWAPD)
  4931			wake_all_kswapds(order, gfp_mask, ac);
  4932	
  4933		/*
  4934		 * The adjusted alloc_flags might result in immediate success, so try
  4935		 * that first
  4936		 */
  4937		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
  4938		if (page)
  4939			goto got_pg;
  4940	
  4941		/*
  4942		 * For costly allocations, try direct compaction first, as it's likely
  4943		 * that we have enough base pages and don't need to reclaim. For non-
  4944		 * movable high-order allocations, do that as well, as compaction will
  4945		 * try prevent permanent fragmentation by migrating from blocks of the
  4946		 * same migratetype.
  4947		 * Don't try this for allocations that are allowed to ignore
  4948		 * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
  4949		 */
  4950		if (can_direct_reclaim &&
  4951				(costly_order ||
  4952				   (order > 0 && ac->migratetype != MIGRATE_MOVABLE))
  4953				&& !gfp_pfmemalloc_allowed(gfp_mask)) {
  4954			page = __alloc_pages_direct_compact(gfp_mask, order,
  4955							alloc_flags, ac,
  4956							INIT_COMPACT_PRIORITY,
  4957							&compact_result);
  4958			if (page)
  4959				goto got_pg;
  4960	
  4961			/*
  4962			 * Checks for costly allocations with __GFP_NORETRY, which
  4963			 * includes some THP page fault allocations
  4964			 */
  4965			if (costly_order && (gfp_mask & __GFP_NORETRY)) {
  4966				/*
  4967				 * If allocating entire pageblock(s) and compaction
  4968				 * failed because all zones are below low watermarks
  4969				 * or is prohibited because it recently failed at this
  4970				 * order, fail immediately unless the allocator has
  4971				 * requested compaction and reclaim retry.
  4972				 *
  4973				 * Reclaim is
  4974				 *  - potentially very expensive because zones are far
  4975				 *    below their low watermarks or this is part of very
  4976				 *    bursty high order allocations,
  4977				 *  - not guaranteed to help because isolate_freepages()
  4978				 *    may not iterate over freed pages as part of its
  4979				 *    linear scan, and
  4980				 *  - unlikely to make entire pageblocks free on its
  4981				 *    own.
  4982				 */
  4983				if (compact_result == COMPACT_SKIPPED ||
  4984				    compact_result == COMPACT_DEFERRED)
  4985					goto nopage;
  4986	
  4987				/*
  4988				 * Looks like reclaim/compaction is worth trying, but
  4989				 * sync compaction could be very expensive, so keep
  4990				 * using async compaction.
  4991				 */
  4992				compact_priority = INIT_COMPACT_PRIORITY;
  4993			}
  4994		}
  4995	
  4996	retry:
  4997		/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
  4998		if (alloc_flags & ALLOC_KSWAPD)
  4999			wake_all_kswapds(order, gfp_mask, ac);
  5000	
  5001		reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
  5002		if (reserve_flags)
  5003			alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags);
  5004	
  5005		/*
  5006		 * Reset the nodemask and zonelist iterators if memory policies can be
  5007		 * ignored. These allocations are high priority and system rather than
  5008		 * user oriented.
  5009		 */
  5010		if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
  5011			ac->nodemask = NULL;
  5012			ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
  5013						ac->highest_zoneidx, ac->nodemask);
  5014		}
  5015	
  5016		/* Attempt with potentially adjusted zonelist and alloc_flags */
  5017		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
  5018		if (page)
  5019			goto got_pg;
  5020	
  5021		/* Caller is not willing to reclaim, we can't balance anything */
  5022		if (!can_direct_reclaim)
  5023			goto nopage;
  5024	
  5025		/* Avoid recursion of direct reclaim */
  5026		if (current->flags & PF_MEMALLOC)
  5027			goto nopage;
  5028	
  5029		/* Try direct reclaim and then allocating */
  5030		page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
  5031								&did_some_progress);
  5032		if (page)
  5033			goto got_pg;
  5034	
  5035		/* Try direct compaction and then allocating */
  5036		page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
  5037						compact_priority, &compact_result);
  5038		if (page)
  5039			goto got_pg;
  5040	
  5041		/* Do not loop if specifically requested */
  5042		if (gfp_mask & __GFP_NORETRY)
  5043			goto nopage;
  5044	
  5045		/*
  5046		 * Do not retry costly high order allocations unless they are
  5047		 * __GFP_RETRY_MAYFAIL
  5048		 */
  5049		if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
  5050			goto nopage;
  5051	
  5052		if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
  5053					 did_some_progress > 0, &no_progress_loops))
  5054			goto retry;
  5055	
  5056		/*
  5057		 * It doesn't make any sense to retry for the compaction if the order-0
  5058		 * reclaim is not able to make any progress because the current
  5059		 * implementation of the compaction depends on the sufficient amount
  5060		 * of free memory (see __compaction_suitable)
  5061		 */
  5062		if (did_some_progress > 0 &&
  5063				should_compact_retry(ac, order, alloc_flags,
  5064					compact_result, &compact_priority,
  5065					&compaction_retries))
  5066			goto retry;
  5067	
  5068	
  5069		/* Deal with possible cpuset update races before we start OOM killing */
  5070		if (check_retry_cpuset(cpuset_mems_cookie, ac))
  5071			goto retry_cpuset;
  5072	
  5073		/* Reclaim has failed us, start killing things */
  5074		page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
  5075		if (page)
  5076			goto got_pg;
  5077	
  5078		/* Avoid allocations with no watermarks from looping endlessly */
  5079		if (tsk_is_oom_victim(current) &&
  5080		    (alloc_flags & ALLOC_OOM ||
  5081		     (gfp_mask & __GFP_NOMEMALLOC)))
  5082			goto nopage;
  5083	
  5084		/* Retry as long as the OOM killer is making progress */
  5085		if (did_some_progress) {
  5086			no_progress_loops = 0;
  5087			goto retry;
  5088		}
  5089	
  5090	nopage:
  5091		/* Deal with possible cpuset update races before we fail */
  5092		if (check_retry_cpuset(cpuset_mems_cookie, ac))
  5093			goto retry_cpuset;
  5094	
  5095		/*
  5096		 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
  5097		 * we always retry
  5098		 */
  5099		if (gfp_mask & __GFP_NOFAIL) {
  5100			/*
  5101			 * All existing users of the __GFP_NOFAIL are blockable, so warn
  5102			 * of any new users that actually require GFP_NOWAIT
  5103			 */
  5104			if (WARN_ON_ONCE(!can_direct_reclaim))
  5105				goto fail;
  5106	
  5107			/*
  5108			 * PF_MEMALLOC request from this context is rather bizarre
  5109			 * because we cannot reclaim anything and only can loop waiting
  5110			 * for somebody to do a work for us
  5111			 */
  5112			WARN_ON_ONCE(current->flags & PF_MEMALLOC);
  5113	
  5114			/*
  5115			 * non failing costly orders are a hard requirement which we
  5116			 * are not prepared for much so let's warn about these users
  5117			 * so that we can identify them and convert them to something
  5118			 * else.
  5119			 */
  5120			WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
  5121	
  5122			/*
  5123			 * Help non-failing allocations by giving them access to memory
  5124			 * reserves but do not use ALLOC_NO_WATERMARKS because this
  5125			 * could deplete whole memory reserves which would just make
  5126			 * the situation worse
  5127			 */
  5128			page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
  5129			if (page)
  5130				goto got_pg;
  5131	
  5132			cond_resched();
  5133			goto retry;
  5134		}
  5135	fail:
  5136		warn_alloc(gfp_mask, ac->nodemask,
  5137				"page allocation failure: order:%u", order);
  5138	got_pg:
  5139		return page;
  5140	}
  5141	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
kernel test robot Sept. 13, 2021, 10:05 a.m. UTC | #3
Hi Feng,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-linux-mm/master]

url:    https://github.com/0day-ci/linux/commits/Feng-Tang/mm-page_alloc-detect-allocation-forbidden-by-cpuset-and-bail-out-early/20210913-154016
base:   https://github.com/hnaz/linux-mm master
config: riscv-randconfig-r042-20210913 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 261cbe98c38f8c1ee1a482fe76511110e790f58a)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install riscv cross compiling tool for clang build
        # apt-get install binutils-riscv64-linux-gnu
        # https://github.com/0day-ci/linux/commit/276fb2292fa199777b3e9a394c8737e4c618cd23
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Feng-Tang/mm-page_alloc-detect-allocation-forbidden-by-cpuset-and-bail-out-early/20210913-154016
        git checkout 276fb2292fa199777b3e9a394c8737e4c618cd23
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=riscv 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from mm/page_alloc.c:20:
   In file included from include/linux/highmem.h:10:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/riscv/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/riscv/include/asm/io.h:136:
   include/asm-generic/io.h:464:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __raw_readb(PCI_IOBASE + addr);
                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:477:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:36:51: note: expanded from macro '__le16_to_cpu'
   #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
                                                     ^
   In file included from mm/page_alloc.c:20:
   In file included from include/linux/highmem.h:10:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/riscv/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/riscv/include/asm/io.h:136:
   include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:34:51: note: expanded from macro '__le32_to_cpu'
   #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
                                                     ^
   In file included from mm/page_alloc.c:20:
   In file included from include/linux/highmem.h:10:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/riscv/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/riscv/include/asm/io.h:136:
   include/asm-generic/io.h:501:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writeb(value, PCI_IOBASE + addr);
                               ~~~~~~~~~~ ^
   include/asm-generic/io.h:511:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:521:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:1024:55: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           return (port > MMIO_UPPER_LIMIT) ? NULL : PCI_IOBASE + port;
                                                     ~~~~~~~~~~ ^
   mm/page_alloc.c:3810:15: warning: no previous prototype for function 'should_fail_alloc_page' [-Wmissing-prototypes]
   noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
                 ^
   mm/page_alloc.c:3810:10: note: declare 'static' if the function is not intended to be used outside of this translation unit
   noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
            ^
            static 
>> mm/page_alloc.c:4922:6: error: implicit declaration of function 'cpusets_insane_config' [-Werror,-Wimplicit-function-declaration]
           if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) {
               ^
   8 warnings and 1 error generated.


vim +/cpusets_insane_config +4922 mm/page_alloc.c

  4868	
  4869	static inline struct page *
  4870	__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  4871							struct alloc_context *ac)
  4872	{
  4873		bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
  4874		const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
  4875		struct page *page = NULL;
  4876		unsigned int alloc_flags;
  4877		unsigned long did_some_progress;
  4878		enum compact_priority compact_priority;
  4879		enum compact_result compact_result;
  4880		int compaction_retries;
  4881		int no_progress_loops;
  4882		unsigned int cpuset_mems_cookie;
  4883		int reserve_flags;
  4884	
  4885		/*
  4886		 * We also sanity check to catch abuse of atomic reserves being used by
  4887		 * callers that are not in atomic context.
  4888		 */
  4889		if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
  4890					(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
  4891			gfp_mask &= ~__GFP_ATOMIC;
  4892	
  4893	retry_cpuset:
  4894		compaction_retries = 0;
  4895		no_progress_loops = 0;
  4896		compact_priority = DEF_COMPACT_PRIORITY;
  4897		cpuset_mems_cookie = read_mems_allowed_begin();
  4898	
  4899		/*
  4900		 * The fast path uses conservative alloc_flags to succeed only until
  4901		 * kswapd needs to be woken up, and to avoid the cost of setting up
  4902		 * alloc_flags precisely. So we do that now.
  4903		 */
  4904		alloc_flags = gfp_to_alloc_flags(gfp_mask);
  4905	
  4906		/*
  4907		 * We need to recalculate the starting point for the zonelist iterator
  4908		 * because we might have used different nodemask in the fast path, or
  4909		 * there was a cpuset modification and we are retrying - otherwise we
  4910		 * could end up iterating over non-eligible zones endlessly.
  4911		 */
  4912		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
  4913						ac->highest_zoneidx, ac->nodemask);
  4914		if (!ac->preferred_zoneref->zone)
  4915			goto nopage;
  4916	
  4917		/*
  4918		 * Check for insane configurations where the cpuset doesn't contain
  4919		 * any suitable zone to satisfy the request - e.g. non-movable
  4920		 * GFP_HIGHUSER allocations from MOVABLE nodes only.
  4921		 */
> 4922		if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) {
  4923			struct zoneref *z = first_zones_zonelist(ac->zonelist,
  4924						ac->highest_zoneidx,
  4925						&cpuset_current_mems_allowed);
  4926			if (!z->zone)
  4927				goto nopage;
  4928		}
  4929	
  4930		if (alloc_flags & ALLOC_KSWAPD)
  4931			wake_all_kswapds(order, gfp_mask, ac);
  4932	
  4933		/*
  4934		 * The adjusted alloc_flags might result in immediate success, so try
  4935		 * that first
  4936		 */
  4937		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
  4938		if (page)
  4939			goto got_pg;
  4940	
  4941		/*
  4942		 * For costly allocations, try direct compaction first, as it's likely
  4943		 * that we have enough base pages and don't need to reclaim. For non-
  4944		 * movable high-order allocations, do that as well, as compaction will
  4945		 * try prevent permanent fragmentation by migrating from blocks of the
  4946		 * same migratetype.
  4947		 * Don't try this for allocations that are allowed to ignore
  4948		 * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
  4949		 */
  4950		if (can_direct_reclaim &&
  4951				(costly_order ||
  4952				   (order > 0 && ac->migratetype != MIGRATE_MOVABLE))
  4953				&& !gfp_pfmemalloc_allowed(gfp_mask)) {
  4954			page = __alloc_pages_direct_compact(gfp_mask, order,
  4955							alloc_flags, ac,
  4956							INIT_COMPACT_PRIORITY,
  4957							&compact_result);
  4958			if (page)
  4959				goto got_pg;
  4960	
  4961			/*
  4962			 * Checks for costly allocations with __GFP_NORETRY, which
  4963			 * includes some THP page fault allocations
  4964			 */
  4965			if (costly_order && (gfp_mask & __GFP_NORETRY)) {
  4966				/*
  4967				 * If allocating entire pageblock(s) and compaction
  4968				 * failed because all zones are below low watermarks
  4969				 * or is prohibited because it recently failed at this
  4970				 * order, fail immediately unless the allocator has
  4971				 * requested compaction and reclaim retry.
  4972				 *
  4973				 * Reclaim is
  4974				 *  - potentially very expensive because zones are far
  4975				 *    below their low watermarks or this is part of very
  4976				 *    bursty high order allocations,
  4977				 *  - not guaranteed to help because isolate_freepages()
  4978				 *    may not iterate over freed pages as part of its
  4979				 *    linear scan, and
  4980				 *  - unlikely to make entire pageblocks free on its
  4981				 *    own.
  4982				 */
  4983				if (compact_result == COMPACT_SKIPPED ||
  4984				    compact_result == COMPACT_DEFERRED)
  4985					goto nopage;
  4986	
  4987				/*
  4988				 * Looks like reclaim/compaction is worth trying, but
  4989				 * sync compaction could be very expensive, so keep
  4990				 * using async compaction.
  4991				 */
  4992				compact_priority = INIT_COMPACT_PRIORITY;
  4993			}
  4994		}
  4995	
  4996	retry:
  4997		/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
  4998		if (alloc_flags & ALLOC_KSWAPD)
  4999			wake_all_kswapds(order, gfp_mask, ac);
  5000	
  5001		reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
  5002		if (reserve_flags)
  5003			alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags);
  5004	
  5005		/*
  5006		 * Reset the nodemask and zonelist iterators if memory policies can be
  5007		 * ignored. These allocations are high priority and system rather than
  5008		 * user oriented.
  5009		 */
  5010		if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
  5011			ac->nodemask = NULL;
  5012			ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
  5013						ac->highest_zoneidx, ac->nodemask);
  5014		}
  5015	
  5016		/* Attempt with potentially adjusted zonelist and alloc_flags */
  5017		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
  5018		if (page)
  5019			goto got_pg;
  5020	
  5021		/* Caller is not willing to reclaim, we can't balance anything */
  5022		if (!can_direct_reclaim)
  5023			goto nopage;
  5024	
  5025		/* Avoid recursion of direct reclaim */
  5026		if (current->flags & PF_MEMALLOC)
  5027			goto nopage;
  5028	
  5029		/* Try direct reclaim and then allocating */
  5030		page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
  5031								&did_some_progress);
  5032		if (page)
  5033			goto got_pg;
  5034	
  5035		/* Try direct compaction and then allocating */
  5036		page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
  5037						compact_priority, &compact_result);
  5038		if (page)
  5039			goto got_pg;
  5040	
  5041		/* Do not loop if specifically requested */
  5042		if (gfp_mask & __GFP_NORETRY)
  5043			goto nopage;
  5044	
  5045		/*
  5046		 * Do not retry costly high order allocations unless they are
  5047		 * __GFP_RETRY_MAYFAIL
  5048		 */
  5049		if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
  5050			goto nopage;
  5051	
  5052		if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
  5053					 did_some_progress > 0, &no_progress_loops))
  5054			goto retry;
  5055	
  5056		/*
  5057		 * It doesn't make any sense to retry for the compaction if the order-0
  5058		 * reclaim is not able to make any progress because the current
  5059		 * implementation of the compaction depends on the sufficient amount
  5060		 * of free memory (see __compaction_suitable)
  5061		 */
  5062		if (did_some_progress > 0 &&
  5063				should_compact_retry(ac, order, alloc_flags,
  5064					compact_result, &compact_priority,
  5065					&compaction_retries))
  5066			goto retry;
  5067	
  5068	
  5069		/* Deal with possible cpuset update races before we start OOM killing */
  5070		if (check_retry_cpuset(cpuset_mems_cookie, ac))
  5071			goto retry_cpuset;
  5072	
  5073		/* Reclaim has failed us, start killing things */
  5074		page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
  5075		if (page)
  5076			goto got_pg;
  5077	
  5078		/* Avoid allocations with no watermarks from looping endlessly */
  5079		if (tsk_is_oom_victim(current) &&
  5080		    (alloc_flags & ALLOC_OOM ||
  5081		     (gfp_mask & __GFP_NOMEMALLOC)))
  5082			goto nopage;
  5083	
  5084		/* Retry as long as the OOM killer is making progress */
  5085		if (did_some_progress) {
  5086			no_progress_loops = 0;
  5087			goto retry;
  5088		}
  5089	
  5090	nopage:
  5091		/* Deal with possible cpuset update races before we fail */
  5092		if (check_retry_cpuset(cpuset_mems_cookie, ac))
  5093			goto retry_cpuset;
  5094	
  5095		/*
  5096		 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
  5097		 * we always retry
  5098		 */
  5099		if (gfp_mask & __GFP_NOFAIL) {
  5100			/*
  5101			 * All existing users of the __GFP_NOFAIL are blockable, so warn
  5102			 * of any new users that actually require GFP_NOWAIT
  5103			 */
  5104			if (WARN_ON_ONCE(!can_direct_reclaim))
  5105				goto fail;
  5106	
  5107			/*
  5108			 * PF_MEMALLOC request from this context is rather bizarre
  5109			 * because we cannot reclaim anything and only can loop waiting
  5110			 * for somebody to do a work for us
  5111			 */
  5112			WARN_ON_ONCE(current->flags & PF_MEMALLOC);
  5113	
  5114			/*
  5115			 * non failing costly orders are a hard requirement which we
  5116			 * are not prepared for much so let's warn about these users
  5117			 * so that we can identify them and convert them to something
  5118			 * else.
  5119			 */
  5120			WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
  5121	
  5122			/*
  5123			 * Help non-failing allocations by giving them access to memory
  5124			 * reserves but do not use ALLOC_NO_WATERMARKS because this
  5125			 * could deplete whole memory reserves which would just make
  5126			 * the situation worse
  5127			 */
  5128			page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
  5129			if (page)
  5130				goto got_pg;
  5131	
  5132			cond_resched();
  5133			goto retry;
  5134		}
  5135	fail:
  5136		warn_alloc(gfp_mask, ac->nodemask,
  5137				"page allocation failure: order:%u", order);
  5138	got_pg:
  5139		return page;
  5140	}
  5141	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Feng Tang Sept. 13, 2021, 11:34 a.m. UTC | #4
On Mon, Sep 13, 2021 at 11:15:54AM +0200, Michal Hocko wrote:
[...] 
> > +/*
> > + * This will get enabled whenever a cpuset configuration is considered
> > + * unsupportable in general. E.g. movable only node which cannot satisfy
> > + * any non movable allocations (see update_nodemask). Page allocator
> > + * needs to make additional checks for those configurations and this
> > + * check is meant to guard those checks without any overhead for sane
> > + * configurations.
> > + */
> > +static inline bool cpusets_insane_config(void)
> > +{
> > +	return static_branch_unlikely(&cpusets_insane_config_key);
> > +}
> > +
> >  extern int cpuset_init(void);
> >  extern void cpuset_init_smp(void);
> >  extern void cpuset_force_rebuild(void);
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 6a1d79d..b69b871 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1220,6 +1220,18 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
> >  #define for_each_zone_zonelist(zone, z, zlist, highidx) \
> >  	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
> >  
> > +/* Whether the 'nodes' are all movable nodes */
> > +static inline bool movable_only_nodes(nodemask_t *nodes)
> > +{
> > +	struct zonelist *zonelist;
> > +	struct zoneref *z;
> > +
> > +	zonelist = &(first_online_pgdat())->node_zonelists[ZONELIST_FALLBACK];
> 
> This will work but it just begs a question why you haven't chosen a node
> from the given nodemask. So I believe it would be easier to read if you
> did
> 	zonelist = NODE_DATA(first_node(nodes))->node_zonelists[ZONELIST_FALLBACK]

This was also my first try to get the 'zonelist', but from the
update_nodemask(), the nodemask could be NULL.

	/*
	 * An empty mems_allowed is ok iff there are no tasks in the cpuset.
	 * Since nodelist_parse() fails on an empty mask, we special case
	 * that parsing.  The validate_change() call ensures that cpusets
	 * with tasks have memory.
	 */
	if (!*buf) {
		nodes_clear(trialcs->mems_allowed);
	
> > +	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
> > +	return (!z->zone) ? true : false;
> > +}
> > +
> > +
> >  #ifdef CONFIG_SPARSEMEM
> >  #include <asm/sparsemem.h>
> >  #endif
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index df1ccf4..03eb40c 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -69,6 +69,13 @@
> >  DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
> >  DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
> >  
> > +/*
> > + * There could be abnormal cpuset configurations for cpu or memory
> > + * node binding, add this key to provide a quick low-cost judgement
> > + * of the situation.
> > + */
> > +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key);
> > +
> >  /* See "Frequency meter" comments, below. */
> >  
> >  struct fmeter {
> > @@ -1868,6 +1875,13 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
> >  	if (retval < 0)
> >  		goto done;
> >  
> > +	if (movable_only_nodes(&trialcs->mems_allowed)) {
> 
> You can skip the check if cpusets_insane_config(). The question is
> whether you want to report all potential users or only the first one.

Yes, I missed that, will add this check

> > +		static_branch_enable(&cpusets_insane_config_key);
> > +		pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! "
> > +			"Cpuset allocations might fail even with a lot of memory available.\n",
> > +			nodemask_pr_args(&trialcs->mems_allowed));
> > +	}
> > +
> >  	spin_lock_irq(&callback_lock);
> >  	cs->mems_allowed = trialcs->mems_allowed;
> >  	spin_unlock_irq(&callback_lock);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index b37435c..a7e0854 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4914,6 +4914,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	if (!ac->preferred_zoneref->zone)
> >  		goto nopage;
> >  
> > +	/*
> > +	 * Check for insane configurations where the cpuset doesn't contain
> > +	 * any suitable zone to satisfy the request - e.g. non-movable
> > +	 * GFP_HIGHUSER allocations from MOVABLE nodes only.
> > +	 */
> > +	if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) {
> > +		struct zoneref *z = first_zones_zonelist(ac->zonelist,
> > +					ac->highest_zoneidx,
> > +					&cpuset_current_mems_allowed);
> > +		if (!z->zone)
> > +			goto nopage;
> > +	}
> > +
> >  	if (alloc_flags & ALLOC_KSWAPD)
> >  		wake_all_kswapds(order, gfp_mask, ac);
> 
> The rest looks sensible to me.

Thanks for the review!

- Feng

> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko Sept. 13, 2021, 11:45 a.m. UTC | #5
On Mon 13-09-21 19:34:23, Feng Tang wrote:
> On Mon, Sep 13, 2021 at 11:15:54AM +0200, Michal Hocko wrote:
[...] 
> > > +/* Whether the 'nodes' are all movable nodes */
> > > +static inline bool movable_only_nodes(nodemask_t *nodes)
> > > +{
> > > +	struct zonelist *zonelist;
> > > +	struct zoneref *z;
> > > +
> > > +	zonelist = &(first_online_pgdat())->node_zonelists[ZONELIST_FALLBACK];
> > 
> > This will work but it just begs a question why you haven't chosen a node
> > from the given nodemask. So I believe it would be easier to read if you
> > did
> > 	zonelist = NODE_DATA(first_node(nodes))->node_zonelists[ZONELIST_FALLBACK]
> 
> This was also my first try to get the 'zonelist', but from the
> update_nodemask(), the nodemask could be NULL.

I guess you meant to say s@NULL@empty@
While this complicates things a bit it is nothing really hard to work
around. You simply check for nodes_empty() and return false because such
a nodemask cannot by definition be movable only.
Feng Tang Sept. 13, 2021, 12:12 p.m. UTC | #6
On Mon, Sep 13, 2021 at 01:45:51PM +0200, Michal Hocko wrote:
> On Mon 13-09-21 19:34:23, Feng Tang wrote:
> > On Mon, Sep 13, 2021 at 11:15:54AM +0200, Michal Hocko wrote:
> [...] 
> > > > +/* Whether the 'nodes' are all movable nodes */
> > > > +static inline bool movable_only_nodes(nodemask_t *nodes)
> > > > +{
> > > > +	struct zonelist *zonelist;
> > > > +	struct zoneref *z;
> > > > +
> > > > +	zonelist = &(first_online_pgdat())->node_zonelists[ZONELIST_FALLBACK];
> > > 
> > > This will work but it just begs a question why you haven't chosen a node
> > > from the given nodemask. So I believe it would be easier to read if you
> > > did
> > > 	zonelist = NODE_DATA(first_node(nodes))->node_zonelists[ZONELIST_FALLBACK]
> > 
> > This was also my first try to get the 'zonelist', but from the
> > update_nodemask(), the nodemask could be NULL.
> 
> I guess you meant to say s@NULL@empty@
> While this complicates things a bit it is nothing really hard to work
> around. You simply check for nodes_empty() and return false because such
> a nodemask cannot by definition be movable only.

Yes, a nodes_empty() check can solve it, thanks,

- Feng

> -- 
> Michal Hocko
> SUSE Labs
diff mbox series

Patch

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index d2b9c41..403ccf9 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -34,6 +34,8 @@ 
  */
 extern struct static_key_false cpusets_pre_enable_key;
 extern struct static_key_false cpusets_enabled_key;
+extern struct static_key_false cpusets_insane_config_key;
+
 static inline bool cpusets_enabled(void)
 {
 	return static_branch_unlikely(&cpusets_enabled_key);
@@ -51,6 +53,19 @@  static inline void cpuset_dec(void)
 	static_branch_dec_cpuslocked(&cpusets_pre_enable_key);
 }
 
+/*
+ * This will get enabled whenever a cpuset configuration is considered
+ * unsupportable in general. E.g. movable only node which cannot satisfy
+ * any non movable allocations (see update_nodemask). Page allocator
+ * needs to make additional checks for those configurations and this
+ * check is meant to guard those checks without any overhead for sane
+ * configurations.
+ */
+static inline bool cpusets_insane_config(void)
+{
+	return static_branch_unlikely(&cpusets_insane_config_key);
+}
+
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6a1d79d..b69b871 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1220,6 +1220,18 @@  static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
 #define for_each_zone_zonelist(zone, z, zlist, highidx) \
 	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
 
+/* Whether the 'nodes' are all movable nodes */
+static inline bool movable_only_nodes(nodemask_t *nodes)
+{
+	struct zonelist *zonelist;
+	struct zoneref *z;
+
+	zonelist = &(first_online_pgdat())->node_zonelists[ZONELIST_FALLBACK];
+	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
+	return (!z->zone) ? true : false;
+}
+
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 #endif
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index df1ccf4..03eb40c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -69,6 +69,13 @@ 
 DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
 DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
 
+/*
+ * There could be abnormal cpuset configurations for cpu or memory
+ * node binding, add this key to provide a quick low-cost judgement
+ * of the situation.
+ */
+DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key);
+
 /* See "Frequency meter" comments, below. */
 
 struct fmeter {
@@ -1868,6 +1875,13 @@  static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		goto done;
 
+	if (movable_only_nodes(&trialcs->mems_allowed)) {
+		static_branch_enable(&cpusets_insane_config_key);
+		pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! "
+			"Cpuset allocations might fail even with a lot of memory available.\n",
+			nodemask_pr_args(&trialcs->mems_allowed));
+	}
+
 	spin_lock_irq(&callback_lock);
 	cs->mems_allowed = trialcs->mems_allowed;
 	spin_unlock_irq(&callback_lock);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c..a7e0854 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4914,6 +4914,19 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (!ac->preferred_zoneref->zone)
 		goto nopage;
 
+	/*
+	 * Check for insane configurations where the cpuset doesn't contain
+	 * any suitable zone to satisfy the request - e.g. non-movable
+	 * GFP_HIGHUSER allocations from MOVABLE nodes only.
+	 */
+	if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) {
+		struct zoneref *z = first_zones_zonelist(ac->zonelist,
+					ac->highest_zoneidx,
+					&cpuset_current_mems_allowed);
+		if (!z->zone)
+			goto nopage;
+	}
+
 	if (alloc_flags & ALLOC_KSWAPD)
 		wake_all_kswapds(order, gfp_mask, ac);