Message ID | 1631003150-96935-1-git-send-email-feng.tang@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/page_alloc: detect allocation forbidden by cpuset and bail out early | expand |
On Tue 07-09-21 16:25:50, Feng Tang wrote: > There was report that starting an Ubuntu in docker while using cpuset > to bind it to movlabe nodes (a node only has movable zone, like a node s@movlabe@movable@ > for hotplug or a Persistent Memory node in normal usage) will fail > due to memory allocation failure, and then OOM is involved and many > other innocent processes got killed. It can be reproduced with command: > $docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c > "grep Mems_allowed /proc/self/status" (node 4 is a movable node) > > The reason is, in the case, the target cpuset nodes only have movable > zone, while the creation of an OS in docker sometimes needs to allocate > memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and > the cpuset limit forbids the allocation, then out-of-memory killing is > involved even when normal nodes and movable nodes both have many free > memory. It would be great to add a oom report here as an example. > The failure is reasonable, but still there is one problem, that when > the usage fails as it's an mission impossible due to the cpuset limit, > the allocation should just not trigger reclaim/compaction, and more > importantly, not get any innocent process oom-killed. I would reformulate to something like: " The OOM killer cannot help to resolve the situation as there is no usable memory for the request in the cpuset scope. The only reasonable measure to take is to fail the allocation right away and have the caller to deal with it. " > So add detection for cases like this in the slowpath of allocation, > and bail out early returning NULL for the allocation. > > We've run some cases of malloc/mmap/page_fault/lru-shm/swap from > will-it-scale and vm-scalability, and didn't see obvious performance > change (all inside +/- 1%), test boxes are 2 socket Cascade Lake and > Icelake servers. > > [thanks to Micho Hocko and David Rientjes for suggesting not handle > it inside OOM code] While this is a good fix from the functionality POV I believe you can go a step further. Please add a detection to the cpuset code and complain to the kernel log if somebody tries to configure movable only cpuset. Once you have that in place you can easily create a static branch for cpuset_insane_setup() and have zero overhead for all reasonable configuration. There shouldn't be any reason to pay a single cpu cycle to check for something that almost nobody does. What do you think?
On Tue, Sep 07, 2021 at 10:44:32AM +0200, Michal Hocko wrote: > On Tue 07-09-21 16:25:50, Feng Tang wrote: > > There was report that starting an Ubuntu in docker while using cpuset > > to bind it to movlabe nodes (a node only has movable zone, like a node > > s@movlabe@movable@ will change. > > for hotplug or a Persistent Memory node in normal usage) will fail > > due to memory allocation failure, and then OOM is involved and many > > other innocent processes got killed. It can be reproduced with command: > > $docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c > > "grep Mems_allowed /proc/self/status" (node 4 is a movable node) > > > > The reason is, in the case, the target cpuset nodes only have movable > > zone, while the creation of an OS in docker sometimes needs to allocate > > memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and > > the cpuset limit forbids the allocation, then out-of-memory killing is > > involved even when normal nodes and movable nodes both have many free > > memory. > > It would be great to add a oom report here as an example. Ok, will add > > The failure is reasonable, but still there is one problem, that when > > the usage fails as it's an mission impossible due to the cpuset limit, > > the allocation should just not trigger reclaim/compaction, and more > > importantly, not get any innocent process oom-killed. > > I would reformulate to something like: > " > The OOM killer cannot help to resolve the situation as there is no > usable memory for the request in the cpuset scope. The only reasonable > measure to take is to fail the allocation right away and have the caller > to deal with it. > " thanks! will use this. > > So add detection for cases like this in the slowpath of allocation, > > and bail out early returning NULL for the allocation. > > > > We've run some cases of malloc/mmap/page_fault/lru-shm/swap from > > will-it-scale and vm-scalability, and didn't see obvious performance > > change (all inside +/- 1%), test boxes are 2 socket Cascade Lake and > > Icelake servers. > > > > [thanks to Micho Hocko and David Rientjes for suggesting not handle > > it inside OOM code] > > While this is a good fix from the functionality POV I believe you can go > a step further. Please add a detection to the cpuset code and complain > to the kernel log if somebody tries to configure movable only cpuset. > Once you have that in place you can easily create a static branch for > cpuset_insane_setup() and have zero overhead for all reasonable > configuration. There shouldn't be any reason to pay a single cpu cycle > to check for something that almost nobody does. > > What do you think? I thought about the implementation, IIUC, the static_branch_enable() is easy, it could be done when cpuset.mems is set with movable only nodes, but disable() is much complexer, as we may need a global reference counter to track the set/unset, and the unset could be the time when freeing the cpuset data structure, also one cpuset.mems could be changed runtime, and system could have multiple cpuset dirs (user space usage could be creative or crazy :)). While checking cpuset code, I thought more about configuring cpuset with movable only nodes, that we may still have normal usage: mallocing a big trunk of memory and do some scientific calculation, or AI training. It works with current code. The usage of using docker to start an full OS is a much complexer case, some of its memory allocations like GFP_HIGHUSER from pipe_write() or copy_process() are limited by the cpuset limit. Thanks, Feng > -- > Michal Hocko > SUSE Labs
On Wed 08-09-21 09:50:14, Feng Tang wrote: > On Tue, Sep 07, 2021 at 10:44:32AM +0200, Michal Hocko wrote: [...] > > While this is a good fix from the functionality POV I believe you can go > > a step further. Please add a detection to the cpuset code and complain > > to the kernel log if somebody tries to configure movable only cpuset. > > Once you have that in place you can easily create a static branch for > > cpuset_insane_setup() and have zero overhead for all reasonable > > configuration. There shouldn't be any reason to pay a single cpu cycle > > to check for something that almost nobody does. > > > > What do you think? > > I thought about the implementation, IIUC, the static_branch_enable() is > easy, it could be done when cpuset.mems is set with movable only nodes, > but disable() is much complexer, Do we care about disable at all? The point is to not have 99,999999% users pay overhead of the check which is irrelevant to them. Once somebody wants to use this "creative" setup then paying an extra check sounds perfectly sensible to me. If somebody cares enough then the disable logic could be implemented. But for now I believe we should be OK with only enable case. > as we may need a global reference > counter to track the set/unset, and the unset could be the time when > freeing the cpuset data structure, also one cpuset.mems could be changed > runtime, and system could have multiple cpuset dirs (user space usage > could be creative or crazy :)). > > While checking cpuset code, I thought more about configuring cpuset with > movable only nodes, that we may still have normal usage: mallocing a big > trunk of memory and do some scientific calculation, or AI training. It > works with current code. It might work but it would be inherently subtle because a single non-movable allocation will throw the whole thing off the cliff. I do not think we want to even pretend we support such a setup.
On Wed, Sep 08, 2021 at 09:06:24AM +0200, Michal Hocko wrote: > On Wed 08-09-21 09:50:14, Feng Tang wrote: > > On Tue, Sep 07, 2021 at 10:44:32AM +0200, Michal Hocko wrote: > [...] > > > While this is a good fix from the functionality POV I believe you can go > > > a step further. Please add a detection to the cpuset code and complain > > > to the kernel log if somebody tries to configure movable only cpuset. > > > Once you have that in place you can easily create a static branch for > > > cpuset_insane_setup() and have zero overhead for all reasonable > > > configuration. There shouldn't be any reason to pay a single cpu cycle > > > to check for something that almost nobody does. > > > > > > What do you think? > > > > I thought about the implementation, IIUC, the static_branch_enable() is > > easy, it could be done when cpuset.mems is set with movable only nodes, > > but disable() is much complexer, > > Do we care about disable at all? The point is to not have 99,999999% > users pay overhead of the check which is irrelevant to them. Once > somebody wants to use this "creative" setup then paying an extra check > sounds perfectly sensible to me. If somebody cares enough then the > disable logic could be implemented. But for now I believe we should be > OK with only enable case. Makes sense to me, thanks! > > as we may need a global reference > > counter to track the set/unset, and the unset could be the time when > > freeing the cpuset data structure, also one cpuset.mems could be changed > > runtime, and system could have multiple cpuset dirs (user space usage > > could be creative or crazy :)). > > > > While checking cpuset code, I thought more about configuring cpuset with > > movable only nodes, that we may still have normal usage: mallocing a big > > trunk of memory and do some scientific calculation, or AI training. It > > works with current code. > > It might work but it would be inherently subtle because a single > non-movable allocation will throw the whole thing off the cliff. Yes, this is a valid concern. Though I think when there is really usage reuqirement for cpuset binding to HBM (High Bandwidth Memory) or PMEM, we may need to reconsider to loose the limit for GFP_HIGHUSER, as the GFP_KERNEL type of allocation is permitted. Thanks, Feng > I do not think we want to even pretend we support such a setup. > -- > Michal Hocko > SUSE Labs
On Wed, Sep 08, 2021 at 09:06:24AM +0200, Michal Hocko wrote: > On Wed 08-09-21 09:50:14, Feng Tang wrote: > > On Tue, Sep 07, 2021 at 10:44:32AM +0200, Michal Hocko wrote: > [...] > > > While this is a good fix from the functionality POV I believe you can go > > > a step further. Please add a detection to the cpuset code and complain > > > to the kernel log if somebody tries to configure movable only cpuset. > > > Once you have that in place you can easily create a static branch for > > > cpuset_insane_setup() and have zero overhead for all reasonable > > > configuration. There shouldn't be any reason to pay a single cpu cycle > > > to check for something that almost nobody does. > > > > > > What do you think? > > > > I thought about the implementation, IIUC, the static_branch_enable() is > > easy, it could be done when cpuset.mems is set with movable only nodes, > > but disable() is much complexer, > > Do we care about disable at all? The point is to not have 99,999999% > users pay overhead of the check which is irrelevant to them. Once > somebody wants to use this "creative" setup then paying an extra check > sounds perfectly sensible to me. If somebody cares enough then the > disable logic could be implemented. But for now I believe we should be > OK with only enable case. Here is tested draft patch to add the check in cpuset code (the looping zone code could be improved by adding a for_each_populated_zone_nodemask macro. Thanks, Feng --- include/linux/cpuset.h | 7 +++++++ include/linux/mmzone.h | 14 ++++++++++++++ kernel/cgroup/cpuset.c | 10 ++++++++++ mm/page_alloc.c | 4 +++- 4 files changed, 34 insertions(+), 1 deletion(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index d2b9c41..a434985 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -34,6 +34,8 @@ */ extern struct static_key_false cpusets_pre_enable_key; extern struct static_key_false cpusets_enabled_key; +extern struct static_key_false cpusets_abnormal_setup_key; + static inline bool cpusets_enabled(void) { return static_branch_unlikely(&cpusets_enabled_key); @@ -51,6 +53,11 @@ static inline void cpuset_dec(void) static_branch_dec_cpuslocked(&cpusets_pre_enable_key); } +static inline bool cpusets_abnormal_check_needed(void) +{ + return static_branch_unlikely(&cpusets_abnormal_setup_key); +} + extern int cpuset_init(void); extern void cpuset_init_smp(void); extern void cpuset_force_rebuild(void); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6a1d79d..c3f5527 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1116,6 +1116,20 @@ extern struct zone *next_zone(struct zone *zone); ; /* do nothing */ \ else +/* Whether the 'nodes' are all movable nodes */ +static inline bool movable_only_nodes(nodemask_t *nodes) +{ + struct zone *zone; + + for_each_populated_zone(zone) { + if (zone_idx(zone) != ZONE_MOVABLE && + node_isset(zone_to_nid(zone), *nodes)) + return false; + } + + return true; +} + static inline struct zone *zonelist_zone(struct zoneref *zoneref) { return zoneref->zone; diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index df1ccf4..e8a9053 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -69,6 +69,13 @@ DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); +/* + * There could be abnormal cpuset configurations for cpu or memory + * node binding, add this key to provide a quick low-cost judgement + * of the situation. + */ +DEFINE_STATIC_KEY_FALSE(cpusets_abnormal_setup_key); + /* See "Frequency meter" comments, below. */ struct fmeter { @@ -1868,6 +1875,9 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, if (retval < 0) goto done; + if (movable_only_nodes(&trialcs->mems_allowed)) + static_branch_enable(&cpusets_abnormal_setup_key); + spin_lock_irq(&callback_lock); cs->mems_allowed = trialcs->mems_allowed; spin_unlock_irq(&callback_lock); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4e455fa..5728675 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4919,7 +4919,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * any suitable zone to satisfy the request - e.g. non-movable * GFP_HIGHUSER allocations from MOVABLE nodes only. */ - if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) { + if (cpusets_enabled() && + cpusets_abnormal_check_needed() && + (gfp_mask & __GFP_HARDWALL)) { struct zoneref *z = first_zones_zonelist(ac->zonelist, ac->highest_zoneidx, &cpuset_current_mems_allowed);
On Fri 10-09-21 15:44:00, Feng Tang wrote: > On Wed, Sep 08, 2021 at 09:06:24AM +0200, Michal Hocko wrote: > > On Wed 08-09-21 09:50:14, Feng Tang wrote: > > > On Tue, Sep 07, 2021 at 10:44:32AM +0200, Michal Hocko wrote: > > [...] > > > > While this is a good fix from the functionality POV I believe you can go > > > > a step further. Please add a detection to the cpuset code and complain > > > > to the kernel log if somebody tries to configure movable only cpuset. > > > > Once you have that in place you can easily create a static branch for > > > > cpuset_insane_setup() and have zero overhead for all reasonable > > > > configuration. There shouldn't be any reason to pay a single cpu cycle > > > > to check for something that almost nobody does. > > > > > > > > What do you think? > > > > > > I thought about the implementation, IIUC, the static_branch_enable() is > > > easy, it could be done when cpuset.mems is set with movable only nodes, > > > but disable() is much complexer, > > > > Do we care about disable at all? The point is to not have 99,999999% > > users pay overhead of the check which is irrelevant to them. Once > > somebody wants to use this "creative" setup then paying an extra check > > sounds perfectly sensible to me. If somebody cares enough then the > > disable logic could be implemented. But for now I believe we should be > > OK with only enable case. > > Here is tested draft patch to add the check in cpuset code (the looping > zone code could be improved by adding a for_each_populated_zone_nodemask > macro. > > Thanks, > Feng > > --- > include/linux/cpuset.h | 7 +++++++ > include/linux/mmzone.h | 14 ++++++++++++++ > kernel/cgroup/cpuset.c | 10 ++++++++++ > mm/page_alloc.c | 4 +++- > 4 files changed, 34 insertions(+), 1 deletion(-) > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index d2b9c41..a434985 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -34,6 +34,8 @@ > */ > extern struct static_key_false cpusets_pre_enable_key; > extern struct static_key_false cpusets_enabled_key; > +extern struct static_key_false cpusets_abnormal_setup_key; > + > static inline bool cpusets_enabled(void) > { > return static_branch_unlikely(&cpusets_enabled_key); > @@ -51,6 +53,11 @@ static inline void cpuset_dec(void) > static_branch_dec_cpuslocked(&cpusets_pre_enable_key); > } > > +static inline bool cpusets_abnormal_check_needed(void) I would go with cpusets_insane_config with a comment explaining what that means. I would also do a pr_info() when the static branch is enabled. [...] > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 4e455fa..5728675 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4919,7 +4919,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > * any suitable zone to satisfy the request - e.g. non-movable > * GFP_HIGHUSER allocations from MOVABLE nodes only. > */ > - if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) { > + if (cpusets_enabled() && > + cpusets_abnormal_check_needed() && You do not need cpusets_enabled check here. Remember the primary point is to not introduce any branch unless a dubious configuration is in place. > + (gfp_mask & __GFP_HARDWALL)) { > struct zoneref *z = first_zones_zonelist(ac->zonelist, > ac->highest_zoneidx, > &cpuset_current_mems_allowed); > -- > 2.7.4 >
On Fri, Sep 10, 2021 at 10:35:17AM +0200, Michal Hocko wrote: [...] > > > > +static inline bool cpusets_abnormal_check_needed(void) > > I would go with cpusets_insane_config with a comment explaining what > that means. I would also do a pr_info() when the static branch is > enabled. > > [...] > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 4e455fa..5728675 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4919,7 +4919,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > * any suitable zone to satisfy the request - e.g. non-movable > > * GFP_HIGHUSER allocations from MOVABLE nodes only. > > */ > > - if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) { > > + if (cpusets_enabled() && > > + cpusets_abnormal_check_needed() && > > You do not need cpusets_enabled check here. Remember the primary point > is to not introduce any branch unless a dubious configuration is in > place. Thanks for the review, patch updated below. Also should we combine this one with the original detection patch? Thanks, Feng --- include/linux/cpuset.h | 13 +++++++++++++ include/linux/mmzone.h | 14 ++++++++++++++ kernel/cgroup/cpuset.c | 13 +++++++++++++ mm/page_alloc.c | 2 +- 4 files changed, 41 insertions(+), 1 deletion(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index d2b9c41..95bacec 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -34,6 +34,8 @@ */ extern struct static_key_false cpusets_pre_enable_key; extern struct static_key_false cpusets_enabled_key; +extern struct static_key_false cpusets_insane_config_key; + static inline bool cpusets_enabled(void) { return static_branch_unlikely(&cpusets_enabled_key); @@ -51,6 +53,17 @@ static inline void cpuset_dec(void) static_branch_dec_cpuslocked(&cpusets_pre_enable_key); } +/* + * Check if there has been insane configurations. E.g. there was usages + * which binds a docker OS to memory nodes with only movable zones, which + * causes system to behave abnormally, as the usage triggers many innocent + * processes get oom-killed. + */ +static inline bool cpusets_insane_config(void) +{ + return static_branch_unlikely(&cpusets_insane_config_key); +} + extern int cpuset_init(void); extern void cpuset_init_smp(void); extern void cpuset_force_rebuild(void); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6a1d79d..c3f5527 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1116,6 +1116,20 @@ extern struct zone *next_zone(struct zone *zone); ; /* do nothing */ \ else +/* Whether the 'nodes' are all movable nodes */ +static inline bool movable_only_nodes(nodemask_t *nodes) +{ + struct zone *zone; + + for_each_populated_zone(zone) { + if (zone_idx(zone) != ZONE_MOVABLE && + node_isset(zone_to_nid(zone), *nodes)) + return false; + } + + return true; +} + static inline struct zone *zonelist_zone(struct zoneref *zoneref) { return zoneref->zone; diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index df1ccf4..e0cb12e 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -69,6 +69,13 @@ DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); +/* + * There could be abnormal cpuset configurations for cpu or memory + * node binding, add this key to provide a quick low-cost judgement + * of the situation. + */ +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key); + /* See "Frequency meter" comments, below. */ struct fmeter { @@ -1868,6 +1875,12 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, if (retval < 0) goto done; + if (movable_only_nodes(&trialcs->mems_allowed)) { + static_branch_enable(&cpusets_insane_config_key); + pr_info("cpuset: See abornal binding to movable nodes only(nmask=%*pbl)\n", + nodemask_pr_args(&trialcs->mems_allowed)); + } + spin_lock_irq(&callback_lock); cs->mems_allowed = trialcs->mems_allowed; spin_unlock_irq(&callback_lock); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4e455fa..a7e0854 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4919,7 +4919,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * any suitable zone to satisfy the request - e.g. non-movable * GFP_HIGHUSER allocations from MOVABLE nodes only. */ - if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) { + if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) { struct zoneref *z = first_zones_zonelist(ac->zonelist, ac->highest_zoneidx, &cpuset_current_mems_allowed);
I would squash the two changes into a single patch. On Fri 10-09-21 17:21:32, Feng Tang wrote: [...] > +/* > + * Check if there has been insane configurations. E.g. there was usages > + * which binds a docker OS to memory nodes with only movable zones, which > + * causes system to behave abnormally, as the usage triggers many innocent > + * processes get oom-killed. I would go with more specifics here. What about /* * This will get enabled whenever a cpuset configuration is considered * unsupportable in general. E.g. movable only node which cannot satisfy * any non movable allocations (see update_nodemask). * Page allocator needs to make additional checks for those * configurations and this check is meant to guard those checks without * any overhead for sane configurations. */ > + */ > +static inline bool cpusets_insane_config(void) > +{ > + return static_branch_unlikely(&cpusets_insane_config_key); > +} > + > extern int cpuset_init(void); > extern void cpuset_init_smp(void); > extern void cpuset_force_rebuild(void); > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 6a1d79d..c3f5527 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -1116,6 +1116,20 @@ extern struct zone *next_zone(struct zone *zone); > ; /* do nothing */ \ > else > > +/* Whether the 'nodes' are all movable nodes */ > +static inline bool movable_only_nodes(nodemask_t *nodes) > +{ > + struct zone *zone; > + > + for_each_populated_zone(zone) { > + if (zone_idx(zone) != ZONE_MOVABLE && > + node_isset(zone_to_nid(zone), *nodes)) > + return false; > + } > + > + return true; Sorry I didn't really get to read this previously. The implementation works but I find it harder to read than really necessary. Why don't you use first_zones_zonelist here as well? > +} > + > static inline struct zone *zonelist_zone(struct zoneref *zoneref) > { > return zoneref->zone; > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index df1ccf4..e0cb12e 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -69,6 +69,13 @@ > DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); > DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); > > +/* > + * There could be abnormal cpuset configurations for cpu or memory > + * node binding, add this key to provide a quick low-cost judgement > + * of the situation. > + */ > +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key); > + > /* See "Frequency meter" comments, below. */ > > struct fmeter { > @@ -1868,6 +1875,12 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, > if (retval < 0) > goto done; > > + if (movable_only_nodes(&trialcs->mems_allowed)) { > + static_branch_enable(&cpusets_insane_config_key); > + pr_info("cpuset: See abornal binding to movable nodes only(nmask=%*pbl)\n", > + nodemask_pr_args(&trialcs->mems_allowed)); This doesn't sound very useful for admins IMHO. It is not clear what the problem is and how to deal with it. What about pr_into("Unsupported (movable nodes only) cpuset configuration detected! Cpuset allocations might fail even with a lot of memory available."); > + } > + > spin_lock_irq(&callback_lock); > cs->mems_allowed = trialcs->mems_allowed; > spin_unlock_irq(&callback_lock); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 4e455fa..a7e0854 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4919,7 +4919,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > * any suitable zone to satisfy the request - e.g. non-movable > * GFP_HIGHUSER allocations from MOVABLE nodes only. > */ > - if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) { > + if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) { > struct zoneref *z = first_zones_zonelist(ac->zonelist, > ac->highest_zoneidx, > &cpuset_current_mems_allowed); > -- > 2.7.4
On Fri, Sep 10, 2021 at 12:35:02PM +0200, Michal Hocko wrote: > I would squash the two changes into a single patch. > > On Fri 10-09-21 17:21:32, Feng Tang wrote: > [...] > > +/* > > + * Check if there has been insane configurations. E.g. there was usages > > + * which binds a docker OS to memory nodes with only movable zones, which > > + * causes system to behave abnormally, as the usage triggers many innocent > > + * processes get oom-killed. > > I would go with more specifics here. What about > > /* > * This will get enabled whenever a cpuset configuration is considered > * unsupportable in general. E.g. movable only node which cannot satisfy > * any non movable allocations (see update_nodemask). > * Page allocator needs to make additional checks for those > * configurations and this check is meant to guard those checks without > * any overhead for sane configurations. > */ will use this, thanks > > + */ > > +static inline bool cpusets_insane_config(void) > > +{ > > + return static_branch_unlikely(&cpusets_insane_config_key); > > +} > > + > > extern int cpuset_init(void); > > extern void cpuset_init_smp(void); > > extern void cpuset_force_rebuild(void); > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 6a1d79d..c3f5527 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -1116,6 +1116,20 @@ extern struct zone *next_zone(struct zone *zone); > > ; /* do nothing */ \ > > else > > > > +/* Whether the 'nodes' are all movable nodes */ > > +static inline bool movable_only_nodes(nodemask_t *nodes) > > +{ > > + struct zone *zone; > > + > > + for_each_populated_zone(zone) { > > + if (zone_idx(zone) != ZONE_MOVABLE && > > + node_isset(zone_to_nid(zone), *nodes)) > > + return false; > > + } > > + > > + return true; > > Sorry I didn't really get to read this previously. The implementation > works but I find it harder to read than really necessary. Why don't you > use first_zones_zonelist here as well? The concern I had was which zonelist to use, local node or the first node of nodemask's node_zonelists[ZONELIST_FALLBACK], and the node_zonelists[] is initialized in node init time, I'm not sure if all nodes's zone will be showed in it (though it is 'yes' for current code). Also, I tried another logic (a new 'for_each_populated_zone_nodes' macro) for_each_populated_zone_nodes(zone, node, *nodes) { if (zone_idx(zone) != ZONE_MOVABLE) return false; } > > +} > > + > > static inline struct zone *zonelist_zone(struct zoneref *zoneref) > > { > > return zoneref->zone; > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > > index df1ccf4..e0cb12e 100644 > > --- a/kernel/cgroup/cpuset.c > > +++ b/kernel/cgroup/cpuset.c > > @@ -69,6 +69,13 @@ > > DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); > > DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); > > > > +/* > > + * There could be abnormal cpuset configurations for cpu or memory > > + * node binding, add this key to provide a quick low-cost judgement > > + * of the situation. > > + */ > > +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key); > > + > > /* See "Frequency meter" comments, below. */ > > > > struct fmeter { > > @@ -1868,6 +1875,12 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, > > if (retval < 0) > > goto done; > > > > + if (movable_only_nodes(&trialcs->mems_allowed)) { > > + static_branch_enable(&cpusets_insane_config_key); > > + pr_info("cpuset: See abornal binding to movable nodes only(nmask=%*pbl)\n", > > + nodemask_pr_args(&trialcs->mems_allowed)); > > This doesn't sound very useful for admins IMHO. It is not clear what the > problem is and how to deal with it. What about > pr_into("Unsupported (movable nodes only) cpuset configuration detected! Cpuset allocations might fail even with a lot of memory available."); Yep, this is better! Thanks, Feng
On Fri 10-09-21 19:29:53, Feng Tang wrote: [...] > > Sorry I didn't really get to read this previously. The implementation > > works but I find it harder to read than really necessary. Why don't you > > use first_zones_zonelist here as well? > > The concern I had was which zonelist to use, local node or the first node > of nodemask's node_zonelists[ZONELIST_FALLBACK], I am not sure I see your concern. Either of the two should work just fine because all nodes should be reachable from the zonelist. But why don't you simply do the same kind of check as in the page allocator?
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f95e1d2386a1..d6657f68d1fb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4929,6 +4929,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (!ac->preferred_zoneref->zone) goto nopage; + /* + * Check for insane configurations where the cpuset doesn't contain + * any suitable zone to satisfy the request - e.g. non-movable + * GFP_HIGHUSER allocations from MOVABLE nodes only. + */ + if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) { + struct zoneref *z = first_zones_zonelist(ac->zonelist, + ac->highest_zoneidx, + &cpuset_current_mems_allowed); + if (!z->zone) + goto nopage; + } + if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac);
There was report that starting an Ubuntu in docker while using cpuset to bind it to movlabe nodes (a node only has movable zone, like a node for hotplug or a Persistent Memory node in normal usage) will fail due to memory allocation failure, and then OOM is involved and many other innocent processes got killed. It can be reproduced with command: $docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status" (node 4 is a movable node) The reason is, in the case, the target cpuset nodes only have movable zone, while the creation of an OS in docker sometimes needs to allocate memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and the cpuset limit forbids the allocation, then out-of-memory killing is involved even when normal nodes and movable nodes both have many free memory. The failure is reasonable, but still there is one problem, that when the usage fails as it's an mission impossible due to the cpuset limit, the allocation should just not trigger reclaim/compaction, and more importantly, not get any innocent process oom-killed. So add detection for cases like this in the slowpath of allocation, and bail out early returning NULL for the allocation. We've run some cases of malloc/mmap/page_fault/lru-shm/swap from will-it-scale and vm-scalability, and didn't see obvious performance change (all inside +/- 1%), test boxes are 2 socket Cascade Lake and Icelake servers. [thanks to Micho Hocko and David Rientjes for suggesting not handle it inside OOM code] Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Feng Tang <feng.tang@intel.com> --- Changelog: since RFC * move the handling from oom code to page allocation path (Michal/David) mm/page_alloc.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)