Message ID | 1631590828-25565-1-git-send-email-feng.tang@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] mm/page_alloc: detect allocation forbidden by cpuset and bail out early | expand |
On 9/14/21 05:40, Feng Tang wrote: > There was report that starting an Ubuntu in docker while using cpuset > to bind it to movable nodes (a node only has movable zone, like a node > for hotplug or a Persistent Memory node in normal usage) will fail > due to memory allocation failure, and then OOM is involved and many > other innocent processes got killed. It can be reproduced with command: > $docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c > "grep Mems_allowed /proc/self/status" (node 4 is a movable node) > > runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0 > CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased) > Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020 > Call Trace: > dump_stack+0x6b/0x88 > dump_header+0x4a/0x1e2 > oom_kill_process.cold+0xb/0x10 > out_of_memory.part.0+0xaf/0x230 > out_of_memory+0x3d/0x80 > __alloc_pages_slowpath.constprop.0+0x954/0xa20 > __alloc_pages_nodemask+0x2d3/0x300 > pipe_write+0x322/0x590 > new_sync_write+0x196/0x1b0 > vfs_write+0x1c3/0x1f0 > ksys_write+0xa7/0xe0 > do_syscall_64+0x52/0xd0 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Mem-Info: > active_anon:392832 inactive_anon:182 isolated_anon:0 > active_file:68130 inactive_file:151527 isolated_file:0 > unevictable:2701 dirty:0 writeback:7 > slab_reclaimable:51418 slab_unreclaimable:116300 > mapped:45825 shmem:735 pagetables:2540 bounce:0 > free:159849484 free_pcp:73 free_cma:0 > Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no > Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB > lowmem_reserve[]: 0 0 0 0 0 > Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB > > oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0 > Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0 > oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB > oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > The reason is, in the case, the target cpuset nodes only have movable > zone, while the creation of an OS in docker sometimes needs to allocate > memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and > the cpuset limit forbids the allocation, then out-of-memory killing is > involved even when normal nodes and movable nodes both have many free > memory. > > The OOM killer cannot help to resolve the situation as there is no > usable memory for the request in the cpuset scope. The only reasonable > measure to take is to fail the allocation right away and have the caller > to deal with it. > > So add a check for cases like this in the slowpath of allocation, and > bail out early returning NULL for the allocation. > > As page allocation is one of the hottest path in kernel, this check > will hurt all users with sane cpuset configuration, add a static branch > check and detect the abnormal config in cpuset memory binding setup so > that the extra check in page allocation is not paid by everyone. > > [thanks to Micho Hocko and David Rientjes for suggesting not handle > it inside OOM code, adding cpuset check, refining comments] > > Suggested-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Feng Tang <feng.tang@intel.com> ... > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 6a1d79d..a455333 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -1220,6 +1220,22 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist, > #define for_each_zone_zonelist(zone, z, zlist, highidx) \ > for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL) > > +/* Whether the 'nodes' are all movable nodes */ > +static inline bool movable_only_nodes(nodemask_t *nodes) > +{ > + struct zonelist *zonelist; > + struct zoneref *z; > + > + if (nodes_empty(*nodes)) > + return false; > + > + zonelist = > + &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK]; > + z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); > + return (!z->zone) ? true : false; > +} Hmm, could all that become just this? !nodes_intersects(&node_states[N_NORMAL_MEMORY], nodes) > + > + > #ifdef CONFIG_SPARSEMEM > #include <asm/sparsemem.h> > #endif > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index df1ccf4..7fa633e 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -69,6 +69,13 @@ > DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); > DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); > > +/* > + * There could be abnormal cpuset configurations for cpu or memory > + * node binding, add this key to provide a quick low-cost judgement > + * of the situation. > + */ > +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key); > + > /* See "Frequency meter" comments, below. */ > > struct fmeter { > @@ -1868,6 +1875,14 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, > if (retval < 0) > goto done; > > + if (!cpusets_insane_config() && > + movable_only_nodes(&trialcs->mems_allowed)) { > + static_branch_enable(&cpusets_insane_config_key); > + pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! " > + "Cpuset allocations might fail even with a lot of memory available.\n", > + nodemask_pr_args(&trialcs->mems_allowed)); > + } > + > spin_lock_irq(&callback_lock); > cs->mems_allowed = trialcs->mems_allowed; > spin_unlock_irq(&callback_lock); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index b37435c..a7e0854 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4914,6 +4914,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > if (!ac->preferred_zoneref->zone) > goto nopage; > > + /* > + * Check for insane configurations where the cpuset doesn't contain > + * any suitable zone to satisfy the request - e.g. non-movable > + * GFP_HIGHUSER allocations from MOVABLE nodes only. > + */ > + if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) { > + struct zoneref *z = first_zones_zonelist(ac->zonelist, > + ac->highest_zoneidx, > + &cpuset_current_mems_allowed); > + if (!z->zone) > + goto nopage; > + } > + > if (alloc_flags & ALLOC_KSWAPD) > wake_all_kswapds(order, gfp_mask, ac); > >
On Tue 14-09-21 10:01:26, Vlastimil Babka wrote: > On 9/14/21 05:40, Feng Tang wrote: [...] > > +/* Whether the 'nodes' are all movable nodes */ > > +static inline bool movable_only_nodes(nodemask_t *nodes) > > +{ > > + struct zonelist *zonelist; > > + struct zoneref *z; > > + > > + if (nodes_empty(*nodes)) > > + return false; > > + > > + zonelist = > > + &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK]; > > + z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); > > + return (!z->zone) ? true : false; > > +} > > Hmm, could all that become just this? > > !nodes_intersects(&node_states[N_NORMAL_MEMORY], nodes) Maybe yes but I find the zonelist approach much easier to follow even though the code looks more complex at first sight. It talks about an empty zone list for ZONE_NORMAL request which is quite clear from the scribble. I always have to re-learn how the N*MEMORY works TBH. Maybe this is just me though.
On Tue 14-09-21 11:40:28, Feng Tang wrote: > There was report that starting an Ubuntu in docker while using cpuset > to bind it to movable nodes (a node only has movable zone, like a node > for hotplug or a Persistent Memory node in normal usage) will fail > due to memory allocation failure, and then OOM is involved and many > other innocent processes got killed. It can be reproduced with command: > $docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c > "grep Mems_allowed /proc/self/status" (node 4 is a movable node) > > runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0 > CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased) > Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020 > Call Trace: > dump_stack+0x6b/0x88 > dump_header+0x4a/0x1e2 > oom_kill_process.cold+0xb/0x10 > out_of_memory.part.0+0xaf/0x230 > out_of_memory+0x3d/0x80 > __alloc_pages_slowpath.constprop.0+0x954/0xa20 > __alloc_pages_nodemask+0x2d3/0x300 > pipe_write+0x322/0x590 > new_sync_write+0x196/0x1b0 > vfs_write+0x1c3/0x1f0 > ksys_write+0xa7/0xe0 > do_syscall_64+0x52/0xd0 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Mem-Info: > active_anon:392832 inactive_anon:182 isolated_anon:0 > active_file:68130 inactive_file:151527 isolated_file:0 > unevictable:2701 dirty:0 writeback:7 > slab_reclaimable:51418 slab_unreclaimable:116300 > mapped:45825 shmem:735 pagetables:2540 bounce:0 > free:159849484 free_pcp:73 free_cma:0 > Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no > Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB > lowmem_reserve[]: 0 0 0 0 0 > Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB > > oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0 > Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0 > oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB > oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > The reason is, in the case, the target cpuset nodes only have movable > zone, while the creation of an OS in docker sometimes needs to allocate > memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and > the cpuset limit forbids the allocation, then out-of-memory killing is > involved even when normal nodes and movable nodes both have many free > memory. > > The OOM killer cannot help to resolve the situation as there is no > usable memory for the request in the cpuset scope. The only reasonable > measure to take is to fail the allocation right away and have the caller > to deal with it. > > So add a check for cases like this in the slowpath of allocation, and > bail out early returning NULL for the allocation. > > As page allocation is one of the hottest path in kernel, this check > will hurt all users with sane cpuset configuration, add a static branch > check and detect the abnormal config in cpuset memory binding setup so > that the extra check in page allocation is not paid by everyone. > > [thanks to Micho Hocko and David Rientjes for suggesting not handle > it inside OOM code, adding cpuset check, refining comments] > > Suggested-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Minor nit below [...] > +/* Whether the 'nodes' are all movable nodes */ > +static inline bool movable_only_nodes(nodemask_t *nodes) > +{ > + struct zonelist *zonelist; > + struct zoneref *z; > + > + if (nodes_empty(*nodes)) > + return false; > + > + zonelist = > + &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK]; > + z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); > + return (!z->zone) ? true : false; This would read easier to me /* * We can chose arbitrary node from the nodemask to get a * zonelist as they are interlinked. We just need to find * at least one zone that can satisfy kernel allocations. */ node = NODE_DATA(first_node(*nodes)); zonelist = node_zonelist(node, GFP_KERNEL); z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes);
On Tue, 14 Sep 2021, Feng Tang wrote: > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index d2b9c41..d58e047 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -34,6 +34,8 @@ > */ > extern struct static_key_false cpusets_pre_enable_key; > extern struct static_key_false cpusets_enabled_key; > +extern struct static_key_false cpusets_insane_config_key; > + > static inline bool cpusets_enabled(void) > { > return static_branch_unlikely(&cpusets_enabled_key); > @@ -51,6 +53,19 @@ static inline void cpuset_dec(void) > static_branch_dec_cpuslocked(&cpusets_pre_enable_key); > } > > +/* > + * This will get enabled whenever a cpuset configuration is considered > + * unsupportable in general. E.g. movable only node which cannot satisfy > + * any non movable allocations (see update_nodemask). Page allocator > + * needs to make additional checks for those configurations and this > + * check is meant to guard those checks without any overhead for sane > + * configurations. > + */ > +static inline bool cpusets_insane_config(void) > +{ > + return static_branch_unlikely(&cpusets_insane_config_key); > +} > + > extern int cpuset_init(void); > extern void cpuset_init_smp(void); > extern void cpuset_force_rebuild(void); > @@ -167,6 +182,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) > > static inline bool cpusets_enabled(void) { return false; } > > +static inline bool cpusets_insane_config(void) { return false; } > + > static inline int cpuset_init(void) { return 0; } > static inline void cpuset_init_smp(void) {} > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 6a1d79d..a455333 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -1220,6 +1220,22 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist, > #define for_each_zone_zonelist(zone, z, zlist, highidx) \ > for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL) > > +/* Whether the 'nodes' are all movable nodes */ > +static inline bool movable_only_nodes(nodemask_t *nodes) > +{ > + struct zonelist *zonelist; > + struct zoneref *z; > + > + if (nodes_empty(*nodes)) > + return false; > + > + zonelist = > + &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK]; > + z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); > + return (!z->zone) ? true : false; > +} > + > + > #ifdef CONFIG_SPARSEMEM > #include <asm/sparsemem.h> > #endif > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index df1ccf4..7fa633e 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -69,6 +69,13 @@ > DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); > DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); > > +/* > + * There could be abnormal cpuset configurations for cpu or memory > + * node binding, add this key to provide a quick low-cost judgement > + * of the situation. > + */ > +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key); > + > /* See "Frequency meter" comments, below. */ > > struct fmeter { > @@ -1868,6 +1875,14 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, > if (retval < 0) > goto done; > > + if (!cpusets_insane_config() && > + movable_only_nodes(&trialcs->mems_allowed)) { > + static_branch_enable(&cpusets_insane_config_key); > + pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! " > + "Cpuset allocations might fail even with a lot of memory available.\n", > + nodemask_pr_args(&trialcs->mems_allowed)); > + } > + > spin_lock_irq(&callback_lock); > cs->mems_allowed = trialcs->mems_allowed; > spin_unlock_irq(&callback_lock); Is this the only time that the state of the nodemask may change? I'm wondering about a single node nodemask, for example, where all ZONE_NORMAL memory is hot-removed. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index b37435c..a7e0854 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4914,6 +4914,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > if (!ac->preferred_zoneref->zone) > goto nopage; > > + /* > + * Check for insane configurations where the cpuset doesn't contain > + * any suitable zone to satisfy the request - e.g. non-movable > + * GFP_HIGHUSER allocations from MOVABLE nodes only. > + */ > + if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) { > + struct zoneref *z = first_zones_zonelist(ac->zonelist, > + ac->highest_zoneidx, > + &cpuset_current_mems_allowed); > + if (!z->zone) > + goto nopage; > + } > + > if (alloc_flags & ALLOC_KSWAPD) > wake_all_kswapds(order, gfp_mask, ac); >
On Tue, Sep 14, 2021 at 05:30:03PM -0700, David Rientjes wrote: > On Tue, 14 Sep 2021, Feng Tang wrote: > > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > > index d2b9c41..d58e047 100644 > > --- a/include/linux/cpuset.h > > +++ b/include/linux/cpuset.h > > @@ -34,6 +34,8 @@ > > */ > > extern struct static_key_false cpusets_pre_enable_key; > > extern struct static_key_false cpusets_enabled_key; > > +extern struct static_key_false cpusets_insane_config_key; > > + > > static inline bool cpusets_enabled(void) > > { > > return static_branch_unlikely(&cpusets_enabled_key); > > @@ -51,6 +53,19 @@ static inline void cpuset_dec(void) > > static_branch_dec_cpuslocked(&cpusets_pre_enable_key); > > } > > > > +/* > > + * This will get enabled whenever a cpuset configuration is considered > > + * unsupportable in general. E.g. movable only node which cannot satisfy > > + * any non movable allocations (see update_nodemask). Page allocator > > + * needs to make additional checks for those configurations and this > > + * check is meant to guard those checks without any overhead for sane > > + * configurations. > > + */ > > +static inline bool cpusets_insane_config(void) > > +{ > > + return static_branch_unlikely(&cpusets_insane_config_key); > > +} > > + > > extern int cpuset_init(void); > > extern void cpuset_init_smp(void); > > extern void cpuset_force_rebuild(void); > > @@ -167,6 +182,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) > > > > static inline bool cpusets_enabled(void) { return false; } > > > > +static inline bool cpusets_insane_config(void) { return false; } > > + > > static inline int cpuset_init(void) { return 0; } > > static inline void cpuset_init_smp(void) {} > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 6a1d79d..a455333 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -1220,6 +1220,22 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist, > > #define for_each_zone_zonelist(zone, z, zlist, highidx) \ > > for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL) > > > > +/* Whether the 'nodes' are all movable nodes */ > > +static inline bool movable_only_nodes(nodemask_t *nodes) > > +{ > > + struct zonelist *zonelist; > > + struct zoneref *z; > > + > > + if (nodes_empty(*nodes)) > > + return false; > > + > > + zonelist = > > + &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK]; > > + z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); > > + return (!z->zone) ? true : false; > > +} > > + > > + > > #ifdef CONFIG_SPARSEMEM > > #include <asm/sparsemem.h> > > #endif > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > > index df1ccf4..7fa633e 100644 > > --- a/kernel/cgroup/cpuset.c > > +++ b/kernel/cgroup/cpuset.c > > @@ -69,6 +69,13 @@ > > DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); > > DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); > > > > +/* > > + * There could be abnormal cpuset configurations for cpu or memory > > + * node binding, add this key to provide a quick low-cost judgement > > + * of the situation. > > + */ > > +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key); > > + > > /* See "Frequency meter" comments, below. */ > > > > struct fmeter { > > @@ -1868,6 +1875,14 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, > > if (retval < 0) > > goto done; > > > > + if (!cpusets_insane_config() && > > + movable_only_nodes(&trialcs->mems_allowed)) { > > + static_branch_enable(&cpusets_insane_config_key); > > + pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! " > > + "Cpuset allocations might fail even with a lot of memory available.\n", > > + nodemask_pr_args(&trialcs->mems_allowed)); > > + } > > + > > spin_lock_irq(&callback_lock); > > cs->mems_allowed = trialcs->mems_allowed; > > spin_unlock_irq(&callback_lock); > > Is this the only time that the state of the nodemask may change? > > I'm wondering about a single node nodemask, for example, where all > ZONE_NORMAL memory is hot-removed. Thanks for the reminding! Yes, memory hot remove can change the cpuset's effective nodemask, we may need to add similar check inside cpuset_hotplug_update_tasks() which is called by cpuset_hotplug_workfn(), something like below? diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 7fa633e..d5f6776 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3186,6 +3186,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus); mems_updated = !nodes_equal(new_mems, cs->effective_mems); + if (mems_updated && !cpusets_insane_config() && + movable_only_nodes(new_mems)) { + static_branch_enable(&cpusets_insane_config_key); + pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl) after memory hotplug." + "Cpuset allocations might fail even with a lot of memory available.\n", + nodemask_pr_args(new_mems); + } + if (is_in_v2_mode()) hotplug_update_tasks(cs, &new_cpus, &new_mems, cpus_updated, mems_updated); Thanks, Feng
On Wed 15-09-21 13:32:47, Feng Tang wrote: > On Tue, Sep 14, 2021 at 05:30:03PM -0700, David Rientjes wrote: [...] > > I'm wondering about a single node nodemask, for example, where all > > ZONE_NORMAL memory is hot-removed. While this is theoretically possible it is highly unlikely to happen. Non movable memory just takes one kernel allocation to prevent any hotremove operation to finish. I have to say I was not aware of the hotplug callback. It all seems rather suspicious. I will have a look. Anyway something worth having covered "just in case". Thanks for pointing it out. > Thanks for the reminding! Yes, memory hot remove can change the > cpuset's effective nodemask, we may need to add similar check inside > cpuset_hotplug_update_tasks() which is called by cpuset_hotplug_workfn(), > something like below? > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index 7fa633e..d5f6776 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -3186,6 +3186,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) > cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus); > mems_updated = !nodes_equal(new_mems, cs->effective_mems); > > + if (mems_updated && !cpusets_insane_config() && > + movable_only_nodes(new_mems)) { > + static_branch_enable(&cpusets_insane_config_key); > + pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl) after memory hotplug." > + "Cpuset allocations might fail even with a lot of memory available.\n", > + nodemask_pr_args(new_mems); > + } Please create a helper rather than two copies of the same. Thanks! > + > if (is_in_v2_mode()) > hotplug_update_tasks(cs, &new_cpus, &new_mems, > cpus_updated, mems_updated); > > Thanks, > Feng
On Wed, Sep 15, 2021 at 01:30:27PM +0200, Michal Hocko wrote: > On Wed 15-09-21 13:32:47, Feng Tang wrote: > > On Tue, Sep 14, 2021 at 05:30:03PM -0700, David Rientjes wrote: > [...] > > > I'm wondering about a single node nodemask, for example, where all > > > ZONE_NORMAL memory is hot-removed. > > While this is theoretically possible it is highly unlikely to happen. > Non movable memory just takes one kernel allocation to prevent any > hotremove operation to finish. I have to say I was not aware of the > hotplug callback. It all seems rather suspicious. I will have a look. > > Anyway something worth having covered "just in case". Thanks for > pointing it out. > > > Thanks for the reminding! Yes, memory hot remove can change the > > cpuset's effective nodemask, we may need to add similar check inside > > cpuset_hotplug_update_tasks() which is called by cpuset_hotplug_workfn(), > > something like below? > > > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > > index 7fa633e..d5f6776 100644 > > --- a/kernel/cgroup/cpuset.c > > +++ b/kernel/cgroup/cpuset.c > > @@ -3186,6 +3186,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) > > cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus); > > mems_updated = !nodes_equal(new_mems, cs->effective_mems); > > > > + if (mems_updated && !cpusets_insane_config() && > > + movable_only_nodes(new_mems)) { > > + static_branch_enable(&cpusets_insane_config_key); > > + pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl) after memory hotplug." > > + "Cpuset allocations might fail even with a lot of memory available.\n", > > + nodemask_pr_args(new_mems); > > + } > > Please create a helper rather than two copies of the same. Thanks! Sure. Some draft add-on patch below. Thanks, Feng diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 7fa633e..3bb9f4ea 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -391,6 +391,18 @@ static inline bool is_in_v2_mode(void) (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE); } +static inline void check_insane_mems_config(nodemask_t *nodes) +{ + if (!cpusets_insane_config() && + movable_only_nodes(nodes)) { + static_branch_enable(&cpusets_insane_config_key); + pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! " + "Cpuset allocations might fail even with a lot of memory available.\n", + nodemask_pr_args(nodes)); + } +} + /* * Return in pmask the portion of a task's cpusets's cpus_allowed that * are online and are capable of running the task. If none are found, @@ -1875,13 +1887,7 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, if (retval < 0) goto done; - if (!cpusets_insane_config() && - movable_only_nodes(&trialcs->mems_allowed)) { - static_branch_enable(&cpusets_insane_config_key); - pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! " - "Cpuset allocations might fail even with a lot of memory available.\n", - nodemask_pr_args(&trialcs->mems_allowed)); - } + check_insane_mems_config(&trialcs->mems_allowed); spin_lock_irq(&callback_lock); cs->mems_allowed = trialcs->mems_allowed; @@ -3186,6 +3192,9 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus); mems_updated = !nodes_equal(new_mems, cs->effective_mems); + if (mems_updated) + check_insane_mems_config(&new_mems); + if (is_in_v2_mode()) hotplug_update_tasks(cs, &new_cpus, &new_mems, cpus_updated, mems_updated);
Hi Michal, On Tue, Sep 14, 2021 at 10:50:42AM +0200, Michal Hocko wrote: > On Tue 14-09-21 11:40:28, Feng Tang wrote: [SPIN] > > The OOM killer cannot help to resolve the situation as there is no > > usable memory for the request in the cpuset scope. The only reasonable > > measure to take is to fail the allocation right away and have the caller > > to deal with it. > > > > So add a check for cases like this in the slowpath of allocation, and > > bail out early returning NULL for the allocation. > > > > As page allocation is one of the hottest path in kernel, this check > > will hurt all users with sane cpuset configuration, add a static branch > > check and detect the abnormal config in cpuset memory binding setup so > > that the extra check in page allocation is not paid by everyone. > > > > [thanks to Micho Hocko and David Rientjes for suggesting not handle > > it inside OOM code, adding cpuset check, refining comments] > > > > Suggested-by: Michal Hocko <mhocko@suse.com> > > Signed-off-by: Feng Tang <feng.tang@intel.com> > > Acked-by: Michal Hocko <mhocko@suse.com> Thank you! > Minor nit below > [...] > > +/* Whether the 'nodes' are all movable nodes */ > > +static inline bool movable_only_nodes(nodemask_t *nodes) > > +{ > > + struct zonelist *zonelist; > > + struct zoneref *z; > > + > > + if (nodes_empty(*nodes)) > > + return false; > > + > > + zonelist = > > + &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK]; > > + z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); > > + return (!z->zone) ? true : false; > > This would read easier to me > /* > * We can chose arbitrary node from the nodemask to get a > * zonelist as they are interlinked. We just need to find > * at least one zone that can satisfy kernel allocations. > */ > node = NODE_DATA(first_node(*nodes)); > zonelist = node_zonelist(node, GFP_KERNEL); > z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); When working on the v4 patch, I see some compile warning that 'node_zonelist()' and 'GFP_KERNEL' are either implicit or undeclared, as they are from "gfp.h". So we may need to move this function to gfp.h or keep the current code with slight modification? nid = first_node(*nodes); zonelist = &NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK]; z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); return (!z->zone) ? true : false; Thanks, Feng
On Fri 24-09-21 14:10:54, Feng Tang wrote: > Hi Michal, > > On Tue, Sep 14, 2021 at 10:50:42AM +0200, Michal Hocko wrote: > > On Tue 14-09-21 11:40:28, Feng Tang wrote: > [SPIN] > > > The OOM killer cannot help to resolve the situation as there is no > > > usable memory for the request in the cpuset scope. The only reasonable > > > measure to take is to fail the allocation right away and have the caller > > > to deal with it. > > > > > > So add a check for cases like this in the slowpath of allocation, and > > > bail out early returning NULL for the allocation. > > > > > > As page allocation is one of the hottest path in kernel, this check > > > will hurt all users with sane cpuset configuration, add a static branch > > > check and detect the abnormal config in cpuset memory binding setup so > > > that the extra check in page allocation is not paid by everyone. > > > > > > [thanks to Micho Hocko and David Rientjes for suggesting not handle > > > it inside OOM code, adding cpuset check, refining comments] > > > > > > Suggested-by: Michal Hocko <mhocko@suse.com> > > > Signed-off-by: Feng Tang <feng.tang@intel.com> > > > > Acked-by: Michal Hocko <mhocko@suse.com> > > Thank you! > > > Minor nit below > > [...] > > > +/* Whether the 'nodes' are all movable nodes */ > > > +static inline bool movable_only_nodes(nodemask_t *nodes) > > > +{ > > > + struct zonelist *zonelist; > > > + struct zoneref *z; > > > + > > > + if (nodes_empty(*nodes)) > > > + return false; > > > + > > > + zonelist = > > > + &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK]; > > > + z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); > > > + return (!z->zone) ? true : false; > > > > This would read easier to me > > /* > > * We can chose arbitrary node from the nodemask to get a > > * zonelist as they are interlinked. We just need to find > > * at least one zone that can satisfy kernel allocations. > > */ > > node = NODE_DATA(first_node(*nodes)); > > zonelist = node_zonelist(node, GFP_KERNEL); > > z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); > > When working on the v4 patch, I see some compile warning > that 'node_zonelist()' and 'GFP_KERNEL' are either implicit > or undeclared, as they are from "gfp.h". > > So we may need to move this function to gfp.h or keep the > current code with slight modification? > > nid = first_node(*nodes); > zonelist = &NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK]; > z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); > return (!z->zone) ? true : false; I would put it into gfp.h but I can see how this might be not really loved there. Both ways work with me.
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index d2b9c41..d58e047 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -34,6 +34,8 @@ */ extern struct static_key_false cpusets_pre_enable_key; extern struct static_key_false cpusets_enabled_key; +extern struct static_key_false cpusets_insane_config_key; + static inline bool cpusets_enabled(void) { return static_branch_unlikely(&cpusets_enabled_key); @@ -51,6 +53,19 @@ static inline void cpuset_dec(void) static_branch_dec_cpuslocked(&cpusets_pre_enable_key); } +/* + * This will get enabled whenever a cpuset configuration is considered + * unsupportable in general. E.g. movable only node which cannot satisfy + * any non movable allocations (see update_nodemask). Page allocator + * needs to make additional checks for those configurations and this + * check is meant to guard those checks without any overhead for sane + * configurations. + */ +static inline bool cpusets_insane_config(void) +{ + return static_branch_unlikely(&cpusets_insane_config_key); +} + extern int cpuset_init(void); extern void cpuset_init_smp(void); extern void cpuset_force_rebuild(void); @@ -167,6 +182,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) static inline bool cpusets_enabled(void) { return false; } +static inline bool cpusets_insane_config(void) { return false; } + static inline int cpuset_init(void) { return 0; } static inline void cpuset_init_smp(void) {} diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6a1d79d..a455333 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1220,6 +1220,22 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist, #define for_each_zone_zonelist(zone, z, zlist, highidx) \ for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL) +/* Whether the 'nodes' are all movable nodes */ +static inline bool movable_only_nodes(nodemask_t *nodes) +{ + struct zonelist *zonelist; + struct zoneref *z; + + if (nodes_empty(*nodes)) + return false; + + zonelist = + &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK]; + z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); + return (!z->zone) ? true : false; +} + + #ifdef CONFIG_SPARSEMEM #include <asm/sparsemem.h> #endif diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index df1ccf4..7fa633e 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -69,6 +69,13 @@ DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); +/* + * There could be abnormal cpuset configurations for cpu or memory + * node binding, add this key to provide a quick low-cost judgement + * of the situation. + */ +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key); + /* See "Frequency meter" comments, below. */ struct fmeter { @@ -1868,6 +1875,14 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, if (retval < 0) goto done; + if (!cpusets_insane_config() && + movable_only_nodes(&trialcs->mems_allowed)) { + static_branch_enable(&cpusets_insane_config_key); + pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! " + "Cpuset allocations might fail even with a lot of memory available.\n", + nodemask_pr_args(&trialcs->mems_allowed)); + } + spin_lock_irq(&callback_lock); cs->mems_allowed = trialcs->mems_allowed; spin_unlock_irq(&callback_lock); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b37435c..a7e0854 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4914,6 +4914,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (!ac->preferred_zoneref->zone) goto nopage; + /* + * Check for insane configurations where the cpuset doesn't contain + * any suitable zone to satisfy the request - e.g. non-movable + * GFP_HIGHUSER allocations from MOVABLE nodes only. + */ + if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) { + struct zoneref *z = first_zones_zonelist(ac->zonelist, + ac->highest_zoneidx, + &cpuset_current_mems_allowed); + if (!z->zone) + goto nopage; + } + if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac);
There was report that starting an Ubuntu in docker while using cpuset to bind it to movable nodes (a node only has movable zone, like a node for hotplug or a Persistent Memory node in normal usage) will fail due to memory allocation failure, and then OOM is involved and many other innocent processes got killed. It can be reproduced with command: $docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status" (node 4 is a movable node) runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0 CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased) Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020 Call Trace: dump_stack+0x6b/0x88 dump_header+0x4a/0x1e2 oom_kill_process.cold+0xb/0x10 out_of_memory.part.0+0xaf/0x230 out_of_memory+0x3d/0x80 __alloc_pages_slowpath.constprop.0+0x954/0xa20 __alloc_pages_nodemask+0x2d3/0x300 pipe_write+0x322/0x590 new_sync_write+0x196/0x1b0 vfs_write+0x1c3/0x1f0 ksys_write+0xa7/0xe0 do_syscall_64+0x52/0xd0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Mem-Info: active_anon:392832 inactive_anon:182 isolated_anon:0 active_file:68130 inactive_file:151527 isolated_file:0 unevictable:2701 dirty:0 writeback:7 slab_reclaimable:51418 slab_unreclaimable:116300 mapped:45825 shmem:735 pagetables:2540 bounce:0 free:159849484 free_pcp:73 free_cma:0 Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0 Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0 oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB The reason is, in the case, the target cpuset nodes only have movable zone, while the creation of an OS in docker sometimes needs to allocate memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and the cpuset limit forbids the allocation, then out-of-memory killing is involved even when normal nodes and movable nodes both have many free memory. The OOM killer cannot help to resolve the situation as there is no usable memory for the request in the cpuset scope. The only reasonable measure to take is to fail the allocation right away and have the caller to deal with it. So add a check for cases like this in the slowpath of allocation, and bail out early returning NULL for the allocation. As page allocation is one of the hottest path in kernel, this check will hurt all users with sane cpuset configuration, add a static branch check and detect the abnormal config in cpuset memory binding setup so that the extra check in page allocation is not paid by everyone. [thanks to Micho Hocko and David Rientjes for suggesting not handle it inside OOM code, adding cpuset check, refining comments] Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Feng Tang <feng.tang@intel.com> --- Changelog: v3: * refine the movable_only_nodes() and the nodemask check in cpuset code (Michal Hocko) * fix a compiling problem (0day test robot) v2: * add a static branch detection in cpuset code to reduce the overhead in allocation hotpath (Michal Hocko) v1 (since RFC): * move the handling from oom code to page allocation path (Michal/David) include/linux/cpuset.h | 17 +++++++++++++++++ include/linux/mmzone.h | 16 ++++++++++++++++ kernel/cgroup/cpuset.c | 15 +++++++++++++++ mm/page_alloc.c | 13 +++++++++++++ 4 files changed, 61 insertions(+)