Message ID | 20220801084207.39086-1-songmuchun@bytedance.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case | expand |
On Mon 01-08-22 16:42:07, Muchun Song wrote: > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy > for filtering nodes for page allocation, which is a hard restriction (see the user > of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred > mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from > policy_nodemask() to fix it, which will not affect current users of policy_nodemask() > since all of the users already have handled the case of MPOL_PREFERRED_MANY before > calling it. BTW, it is found by code inspection. I am not sure this is the right fix. It is quite true that policy_nodemask is a tricky function to use. It pretends to have a higher level logic but all existing users are expected to be policy aware and they special case allocation for each policy. That would mean that hugetlb should do the same. I haven't checked the actual behavior implications for hugetlb here. Is MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it work? From a quick look this just ignores MPOL_PREFERRED_MANY completely. > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > --- > mm/mempolicy.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 6c27acb6cd63..4deec7e598c6 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > return &policy->nodes; > > - if (mode == MPOL_PREFERRED_MANY) > - return &policy->nodes; > - > return NULL; > } > > -- > 2.11.0
On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote: > On Mon 01-08-22 16:42:07, Muchun Song wrote: > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy > > for filtering nodes for page allocation, which is a hard restriction (see the user > > of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred > > mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask() > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before > > calling it. BTW, it is found by code inspection. > > I am not sure this is the right fix. It is quite true that > policy_nodemask is a tricky function to use. It pretends to have a > higher level logic but all existing users are expected to be policy > aware and they special case allocation for each policy. That would mean > that hugetlb should do the same. Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also confused about policy_nodemask(), as it is never a 'strict' one as the old code is: if (unlikely(mode == MPOL_BIND) && apply_policy_zone(policy, gfp_zone(gfp)) && cpuset_nodemask_valid_mems_allowed(&policy->nodes)) return &policy->nodes; return NULL Even when the MPOL_BIND's nodes is not allowed by cpuset, it will still return NULL (equals all nodes). From the semantics of allowed_mems_nr(), I think it does get changed a little by b27abaccf8e8. And to enforce the 'strict' semantic for 'allowed', we may need a more strict nodemask API for it. > I haven't checked the actual behavior implications for hugetlb here. Is > MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it > work? From a quick look this just ignores MPOL_PREFERRED_MANY > completely. IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double check and report back if otherwise. > > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > --- > > mm/mempolicy.c | 3 --- > > 1 file changed, 3 deletions(-) > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 6c27acb6cd63..4deec7e598c6 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > return &policy->nodes; > > > > - if (mode == MPOL_PREFERRED_MANY) > > - return &policy->nodes; I think it will make MPOL_PREFERRED_MANY not usable. Thanks, Feng > > - > > return NULL; > > } > > > > -- > > 2.11.0 > > -- > Michal Hocko > SUSE Labs
On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote: > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote: > > On Mon 01-08-22 16:42:07, Muchun Song wrote: > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy > > > for filtering nodes for page allocation, which is a hard restriction (see the user > > > of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred > > > mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask() > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before > > > calling it. BTW, it is found by code inspection. > > > > I am not sure this is the right fix. It is quite true that > > policy_nodemask is a tricky function to use. It pretends to have a > > higher level logic but all existing users are expected to be policy > > aware and they special case allocation for each policy. That would mean > > that hugetlb should do the same. > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also > confused about policy_nodemask(), as it is never a 'strict' one as > the old code is: > > if (unlikely(mode == MPOL_BIND) && > apply_policy_zone(policy, gfp_zone(gfp)) && > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > return &policy->nodes; > > return NULL > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will > still return NULL (equals all nodes). > Well, I agree policy_nodemask() is really confusing because of the shortage of comments and the weird logic. > From the semantics of allowed_mems_nr(), I think it does get changed > a little by b27abaccf8e8. And to enforce the 'strict' semantic for > 'allowed', we may need a more strict nodemask API for it. > Maybe this is a good idea to fix this, e.g. introducing a new helper to return the strict allowed nodemask. > > I haven't checked the actual behavior implications for hugetlb here. Is > > MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it > > work? From a quick look this just ignores MPOL_PREFERRED_MANY > > completely. > > IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double > check and report back if otherwise. > > > > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > > --- > > > mm/mempolicy.c | 3 --- > > > 1 file changed, 3 deletions(-) > > > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > > index 6c27acb6cd63..4deec7e598c6 100644 > > > --- a/mm/mempolicy.c > > > +++ b/mm/mempolicy.c > > > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > > return &policy->nodes; > > > > > > - if (mode == MPOL_PREFERRED_MANY) > > > - return &policy->nodes; > > I think it will make MPOL_PREFERRED_MANY not usable. > Sorry, I didn't got what you mean here. Could you explain more details about why it is not usable? Thanks. > Thanks, > Feng > > > > - > > > return NULL; > > > } > > > > > > -- > > > 2.11.0 > > > > -- > > Michal Hocko > > SUSE Labs >
On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote: > On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote: > > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote: > > > On Mon 01-08-22 16:42:07, Muchun Song wrote: > > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy > > > > for filtering nodes for page allocation, which is a hard restriction (see the user > > > > of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred > > > > mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from > > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask() > > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before > > > > calling it. BTW, it is found by code inspection. > > > > > > I am not sure this is the right fix. It is quite true that > > > policy_nodemask is a tricky function to use. It pretends to have a > > > higher level logic but all existing users are expected to be policy > > > aware and they special case allocation for each policy. That would mean > > > that hugetlb should do the same. > > > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also > > confused about policy_nodemask(), as it is never a 'strict' one as > > the old code is: > > > > if (unlikely(mode == MPOL_BIND) && > > apply_policy_zone(policy, gfp_zone(gfp)) && > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > return &policy->nodes; > > > > return NULL > > > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will > > still return NULL (equals all nodes). > > > > Well, I agree policy_nodemask() is really confusing because of the > shortage of comments and the weird logic. > > > From the semantics of allowed_mems_nr(), I think it does get changed > > a little by b27abaccf8e8. And to enforce the 'strict' semantic for > > 'allowed', we may need a more strict nodemask API for it. > > > > Maybe this is a good idea to fix this, e.g. introducing a new helper > to return the strict allowed nodemask. Yep. I had another thought to add one global all-zero nodemask, for API like policy_nodemask(), it has 2 types of return value: * a nodemask with some bits set * NULL (means all nodes) Here a new type of zero nodemask (a gloabl variable)can be created to indicate no qualified node. > > > I haven't checked the actual behavior implications for hugetlb here. Is > > > MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it > > > work? From a quick look this just ignores MPOL_PREFERRED_MANY > > > completely. > > > > IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double > > check and report back if otherwise. > > > > > > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > > > --- > > > > mm/mempolicy.c | 3 --- > > > > 1 file changed, 3 deletions(-) > > > > > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > > > index 6c27acb6cd63..4deec7e598c6 100644 > > > > --- a/mm/mempolicy.c > > > > +++ b/mm/mempolicy.c > > > > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > > > return &policy->nodes; > > > > > > > > - if (mode == MPOL_PREFERRED_MANY) > > > > - return &policy->nodes; > > > > I think it will make MPOL_PREFERRED_MANY not usable. > > > > Sorry, I didn't got what you mean here. Could you explain more details > about why it is not usable? I thought alloc_pages() will rely on policy_nodemask(), which was wrong as I forgot the MPOL_PREFERRED_MANY has a dedicated function alloc_pages_preferred_many() to handle it. Sorry for the confusion. Thanks, Feng > Thanks. > > > Thanks, > > Feng > > > > > > - > > > > return NULL; > > > > } > > > > > > > > -- > > > > 2.11.0 > > > > > > -- > > > Michal Hocko > > > SUSE Labs > > >
On Tue, Aug 02, 2022 at 01:52:05PM +0800, Feng Tang wrote: > On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote: > > On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote: > > > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote: > > > > On Mon 01-08-22 16:42:07, Muchun Song wrote: > > > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy > > > > > for filtering nodes for page allocation, which is a hard restriction (see the user > > > > > of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred > > > > > mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from > > > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask() > > > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before > > > > > calling it. BTW, it is found by code inspection. > > > > > > > > I am not sure this is the right fix. It is quite true that > > > > policy_nodemask is a tricky function to use. It pretends to have a > > > > higher level logic but all existing users are expected to be policy > > > > aware and they special case allocation for each policy. That would mean > > > > that hugetlb should do the same. > > > > > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also > > > confused about policy_nodemask(), as it is never a 'strict' one as > > > the old code is: > > > > > > if (unlikely(mode == MPOL_BIND) && > > > apply_policy_zone(policy, gfp_zone(gfp)) && > > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > > return &policy->nodes; > > > > > > return NULL > > > > > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will > > > still return NULL (equals all nodes). > > > > > > > Well, I agree policy_nodemask() is really confusing because of the > > shortage of comments and the weird logic. > > > > > From the semantics of allowed_mems_nr(), I think it does get changed > > > a little by b27abaccf8e8. And to enforce the 'strict' semantic for > > > 'allowed', we may need a more strict nodemask API for it. > > > > > > > Maybe this is a good idea to fix this, e.g. introducing a new helper > > to return the strict allowed nodemask. > > Yep. > > I had another thought to add one global all-zero nodemask, for API like > policy_nodemask(), it has 2 types of return value: > * a nodemask with some bits set > * NULL (means all nodes) > > Here a new type of zero nodemask (a gloabl variable)can be created to > indicate no qualified node. > I know why you want to introduce a gloable zero nidemask. Since we already have a glable nodemask array, namely node_states, instead of returning NULL for the case of all nodes, how about returing node_states[N_ONLINE] for it? And make it return NULL for the case where no nodes are allowed. Any thought? > > > > I haven't checked the actual behavior implications for hugetlb here. Is > > > > MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it > > > > work? From a quick look this just ignores MPOL_PREFERRED_MANY > > > > completely. > > > > > > IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double > > > check and report back if otherwise. > > > > > > > > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > > > > --- > > > > > mm/mempolicy.c | 3 --- > > > > > 1 file changed, 3 deletions(-) > > > > > > > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > > > > index 6c27acb6cd63..4deec7e598c6 100644 > > > > > --- a/mm/mempolicy.c > > > > > +++ b/mm/mempolicy.c > > > > > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > > > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > > > > return &policy->nodes; > > > > > > > > > > - if (mode == MPOL_PREFERRED_MANY) > > > > > - return &policy->nodes; > > > > > > I think it will make MPOL_PREFERRED_MANY not usable. > > > > > > > Sorry, I didn't got what you mean here. Could you explain more details > > about why it is not usable? > > I thought alloc_pages() will rely on policy_nodemask(), which was wrong > as I forgot the MPOL_PREFERRED_MANY has a dedicated function > alloc_pages_preferred_many() to handle it. Sorry for the confusion. > > Thanks, > Feng > > > Thanks. > > > > > Thanks, > > > Feng > > > > > > > > - > > > > > return NULL; > > > > > } > > > > > > > > > > -- > > > > > 2.11.0 > > > > > > > > -- > > > > Michal Hocko > > > > SUSE Labs > > > > > >
On Tue, Aug 02, 2022 at 02:40:11PM +0800, Muchun Song wrote: > On Tue, Aug 02, 2022 at 01:52:05PM +0800, Feng Tang wrote: > > On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote: > > > On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote: > > > > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote: > > > > > On Mon 01-08-22 16:42:07, Muchun Song wrote: > > > > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy > > > > > > for filtering nodes for page allocation, which is a hard restriction (see the user > > > > > > of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred > > > > > > mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from > > > > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask() > > > > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before > > > > > > calling it. BTW, it is found by code inspection. > > > > > > > > > > I am not sure this is the right fix. It is quite true that > > > > > policy_nodemask is a tricky function to use. It pretends to have a > > > > > higher level logic but all existing users are expected to be policy > > > > > aware and they special case allocation for each policy. That would mean > > > > > that hugetlb should do the same. > > > > > > > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also > > > > confused about policy_nodemask(), as it is never a 'strict' one as > > > > the old code is: > > > > > > > > if (unlikely(mode == MPOL_BIND) && > > > > apply_policy_zone(policy, gfp_zone(gfp)) && > > > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > > > return &policy->nodes; > > > > > > > > return NULL > > > > > > > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will > > > > still return NULL (equals all nodes). > > > > > > > > > > Well, I agree policy_nodemask() is really confusing because of the > > > shortage of comments and the weird logic. > > > > > > > From the semantics of allowed_mems_nr(), I think it does get changed > > > > a little by b27abaccf8e8. And to enforce the 'strict' semantic for > > > > 'allowed', we may need a more strict nodemask API for it. > > > > > > > > > > Maybe this is a good idea to fix this, e.g. introducing a new helper > > > to return the strict allowed nodemask. > > > > Yep. > > > > I had another thought to add one global all-zero nodemask, for API like > > policy_nodemask(), it has 2 types of return value: > > * a nodemask with some bits set > > * NULL (means all nodes) > > > > Here a new type of zero nodemask (a gloabl variable)can be created to > > indicate no qualified node. > > > > I know why you want to introduce a gloable zero nidemask. Since we already > have a glable nodemask array, namely node_states, instead of returning NULL > for the case of all nodes, how about returing node_states[N_ONLINE] for it? > And make it return NULL for the case where no nodes are allowed. Any thought? I think return node_states[N_ONLINE] can simplify the code in allowed_mems_nr(), the empty zero nodes can simplify further. Here is some draft patch (not tested) to show the idea Thanks, Feng --- include/linux/mempolicy.h | 8 ++++++++ include/linux/nodemask.h | 7 +++++++ mm/hugetlb.c | 7 ++++--- mm/mempolicy.c | 17 +++++++++++++++++ mm/page_alloc.c | 3 +++ 5 files changed, 39 insertions(+), 3 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 668389b4b53d..b5451fef1620 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -150,6 +150,7 @@ extern bool init_nodemask_of_mempolicy(nodemask_t *mask); extern bool mempolicy_in_oom_domain(struct task_struct *tsk, const nodemask_t *mask); extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); +extern nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy); static inline nodemask_t *policy_nodemask_current(gfp_t gfp) { @@ -158,6 +159,13 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp) return policy_nodemask(gfp, mpol); } +static inline nodemask_t *allowed_policy_nodemask_current(gfp_t gfp) +{ + struct mempolicy *mpol = get_task_policy(current); + + return allowed_policy_nodemask(gfp, mpol); +} + extern unsigned int mempolicy_slab_node(void); extern enum zone_type policy_zone; diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index 0f233b76c9ce..dc5fab38e810 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -409,6 +409,13 @@ enum node_states { extern nodemask_t node_states[NR_NODE_STATES]; +extern nodemask_t zero_nodes; + +static inline bool is_empty_nodes(nodemask_t *pnodes) +{ + return (pnodes == &zero_nodes || __nodes_empty(pnodes, MAX_NUMNODES)); +} + #if MAX_NUMNODES > 1 static inline int node_state(int node, enum node_states state) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a57e1be41401..dc9f4ed32909 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4340,10 +4340,11 @@ static unsigned int allowed_mems_nr(struct hstate *h) mpol_allowed = policy_nodemask_current(gfp_mask); - for_each_node_mask(node, cpuset_current_mems_allowed) { - if (!mpol_allowed || node_isset(node, *mpol_allowed)) + if (is_empty_nodes(mpol_allowed)) + return 0; + + for_each_node_mask(node, mpol_allowed) nr += array[node]; - } return nr; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d39b01fd52fe..3e936b8ca9ea 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1845,6 +1845,23 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) return NULL; } +/* + * Return the allowed nodes mask for a mempolicy and page allocation, + * which is a 'stricter' semantic than policy_nodemsk() + */ +nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy) +{ + if (unlikely(policy->mode == MPOL_BIND)) { + if (apply_policy_zone(policy, gfp_zone(gfp)) && + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) + return &policy->nodes; + else + return &zero_nodes; + } + + return NULL; +} + /* * Return the preferred node id for 'prefer' mempolicy, and return * the given id for all other policies. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e008a3df0485..3549ea037588 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -181,6 +181,9 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = { }; EXPORT_SYMBOL(node_states); +nodemask_t zero_nodes = NODE_MASK_NONE; +EXPORT_SYMBOL(zero_nodes); + atomic_long_t _totalram_pages __read_mostly; EXPORT_SYMBOL(_totalram_pages); unsigned long totalreserve_pages __read_mostly;
Please make sure to CC Mike on hugetlb related changes. I didn't really get to grasp your proposed solution but it feels goind sideways. The real issue is that hugetlb uses a dedicated allocation scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not think we should be tricking that by providing some fake nodemasks and what not. The good news is that allocation from the pool is MPOL_PREFERRED_MANY aware because it first tries to allocation from the preffered node mask and then fall back to the full nodemask (dequeue_huge_page_vma). If the existing pools cannot really satisfy that allocation then it tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also performs 2 stage allocation with the node mask and no node masks. But both of them might fail. The bad news is that other allocation functions - including those that allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g. __nr_hugepages_store_common paths which use the allocating process policy to fill up the pool so the pool could be under provisioned if that context is using MPOL_PREFERRED_MANY. Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation code and I have to admit I do not really remember details there. This is a subtle code and my best guess would be that policy_nodemask_current should be hugetlb specific and only care about MPOL_BIND. On Tue 02-08-22 15:39:52, Feng Tang wrote: > On Tue, Aug 02, 2022 at 02:40:11PM +0800, Muchun Song wrote: > > On Tue, Aug 02, 2022 at 01:52:05PM +0800, Feng Tang wrote: > > > On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote: > > > > On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote: > > > > > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote: > > > > > > On Mon 01-08-22 16:42:07, Muchun Song wrote: > > > > > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy > > > > > > > for filtering nodes for page allocation, which is a hard restriction (see the user > > > > > > > of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred > > > > > > > mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from > > > > > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask() > > > > > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before > > > > > > > calling it. BTW, it is found by code inspection. > > > > > > > > > > > > I am not sure this is the right fix. It is quite true that > > > > > > policy_nodemask is a tricky function to use. It pretends to have a > > > > > > higher level logic but all existing users are expected to be policy > > > > > > aware and they special case allocation for each policy. That would mean > > > > > > that hugetlb should do the same. > > > > > > > > > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also > > > > > confused about policy_nodemask(), as it is never a 'strict' one as > > > > > the old code is: > > > > > > > > > > if (unlikely(mode == MPOL_BIND) && > > > > > apply_policy_zone(policy, gfp_zone(gfp)) && > > > > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > > > > return &policy->nodes; > > > > > > > > > > return NULL > > > > > > > > > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will > > > > > still return NULL (equals all nodes). > > > > > > > > > > > > > Well, I agree policy_nodemask() is really confusing because of the > > > > shortage of comments and the weird logic. > > > > > > > > > From the semantics of allowed_mems_nr(), I think it does get changed > > > > > a little by b27abaccf8e8. And to enforce the 'strict' semantic for > > > > > 'allowed', we may need a more strict nodemask API for it. > > > > > > > > > > > > > Maybe this is a good idea to fix this, e.g. introducing a new helper > > > > to return the strict allowed nodemask. > > > > > > Yep. > > > > > > I had another thought to add one global all-zero nodemask, for API like > > > policy_nodemask(), it has 2 types of return value: > > > * a nodemask with some bits set > > > * NULL (means all nodes) > > > > > > Here a new type of zero nodemask (a gloabl variable)can be created to > > > indicate no qualified node. > > > > > > > I know why you want to introduce a gloable zero nidemask. Since we already > > have a glable nodemask array, namely node_states, instead of returning NULL > > for the case of all nodes, how about returing node_states[N_ONLINE] for it? > > And make it return NULL for the case where no nodes are allowed. Any thought? > > I think return node_states[N_ONLINE] can simplify the code in allowed_mems_nr(), > the empty zero nodes can simplify further. > > Here is some draft patch (not tested) to show the idea > > Thanks, > Feng > > --- > include/linux/mempolicy.h | 8 ++++++++ > include/linux/nodemask.h | 7 +++++++ > mm/hugetlb.c | 7 ++++--- > mm/mempolicy.c | 17 +++++++++++++++++ > mm/page_alloc.c | 3 +++ > 5 files changed, 39 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 668389b4b53d..b5451fef1620 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -150,6 +150,7 @@ extern bool init_nodemask_of_mempolicy(nodemask_t *mask); > extern bool mempolicy_in_oom_domain(struct task_struct *tsk, > const nodemask_t *mask); > extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); > +extern nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy); > > static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > { > @@ -158,6 +159,13 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > return policy_nodemask(gfp, mpol); > } > > +static inline nodemask_t *allowed_policy_nodemask_current(gfp_t gfp) > +{ > + struct mempolicy *mpol = get_task_policy(current); > + > + return allowed_policy_nodemask(gfp, mpol); > +} > + > extern unsigned int mempolicy_slab_node(void); > > extern enum zone_type policy_zone; > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h > index 0f233b76c9ce..dc5fab38e810 100644 > --- a/include/linux/nodemask.h > +++ b/include/linux/nodemask.h > @@ -409,6 +409,13 @@ enum node_states { > > extern nodemask_t node_states[NR_NODE_STATES]; > > +extern nodemask_t zero_nodes; > + > +static inline bool is_empty_nodes(nodemask_t *pnodes) > +{ > + return (pnodes == &zero_nodes || __nodes_empty(pnodes, MAX_NUMNODES)); > +} > + > #if MAX_NUMNODES > 1 > static inline int node_state(int node, enum node_states state) > { > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index a57e1be41401..dc9f4ed32909 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4340,10 +4340,11 @@ static unsigned int allowed_mems_nr(struct hstate *h) > > mpol_allowed = policy_nodemask_current(gfp_mask); > > - for_each_node_mask(node, cpuset_current_mems_allowed) { > - if (!mpol_allowed || node_isset(node, *mpol_allowed)) > + if (is_empty_nodes(mpol_allowed)) > + return 0; > + > + for_each_node_mask(node, mpol_allowed) > nr += array[node]; > - } > > return nr; > } > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index d39b01fd52fe..3e936b8ca9ea 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1845,6 +1845,23 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > return NULL; > } > > +/* > + * Return the allowed nodes mask for a mempolicy and page allocation, > + * which is a 'stricter' semantic than policy_nodemsk() > + */ > +nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy) > +{ > + if (unlikely(policy->mode == MPOL_BIND)) { > + if (apply_policy_zone(policy, gfp_zone(gfp)) && > + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > + return &policy->nodes; > + else > + return &zero_nodes; > + } > + > + return NULL; > +} > + > /* > * Return the preferred node id for 'prefer' mempolicy, and return > * the given id for all other policies. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e008a3df0485..3549ea037588 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -181,6 +181,9 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = { > }; > EXPORT_SYMBOL(node_states); > > +nodemask_t zero_nodes = NODE_MASK_NONE; > +EXPORT_SYMBOL(zero_nodes); > + > atomic_long_t _totalram_pages __read_mostly; > EXPORT_SYMBOL(_totalram_pages); > unsigned long totalreserve_pages __read_mostly; > -- > 2.27.0 >
On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote: > Please make sure to CC Mike on hugetlb related changes. OK. > I didn't really get to grasp your proposed solution but it feels goind > sideways. The real issue is that hugetlb uses a dedicated allocation > scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not > think we should be tricking that by providing some fake nodemasks and > what not. > > The good news is that allocation from the pool is MPOL_PREFERRED_MANY > aware because it first tries to allocation from the preffered node mask > and then fall back to the full nodemask (dequeue_huge_page_vma). > If the existing pools cannot really satisfy that allocation then it > tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also > performs 2 stage allocation with the node mask and no node masks. But > both of them might fail. > > The bad news is that other allocation functions - including those that > allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g. > __nr_hugepages_store_common paths which use the allocating process > policy to fill up the pool so the pool could be under provisioned if > that context is using MPOL_PREFERRED_MANY. Thanks for the check! So you mean if the prferred nodes don't have enough pages, we should also fallback to all like dequeue_huge_page_vma() does? Or we can user a policy API which return nodemask for MPOL_BIND and NULL for all other policies, like allowed_mems_nr() needs. --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp) return policy_nodemask(gfp, mpol); } +#ifdef CONFIG_HUGETLB_FS +static inline nodemask_t *strict_policy_nodemask_current(void) +{ + struct mempolicy *mpol = get_task_policy(current); + + if (mpol->mode == MPOL_BIND) + return &mpol->nodes; + + return NULL; +} +#endif + > Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation > code and I have to admit I do not really remember details there. This is > a subtle code and my best guess would be that policy_nodemask_current > should be hugetlb specific and only care about MPOL_BIND. The API needed by allowed_mem_nr() is a little different as it has gfp flag and cpuset config to consider. Thanks, Feng [snip]
On Wed 03-08-22 14:41:20, Feng Tang wrote: > On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote: > > Please make sure to CC Mike on hugetlb related changes. > > OK. > > > I didn't really get to grasp your proposed solution but it feels goind > > sideways. The real issue is that hugetlb uses a dedicated allocation > > scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not > > think we should be tricking that by providing some fake nodemasks and > > what not. > > > > The good news is that allocation from the pool is MPOL_PREFERRED_MANY > > aware because it first tries to allocation from the preffered node mask > > and then fall back to the full nodemask (dequeue_huge_page_vma). > > If the existing pools cannot really satisfy that allocation then it > > tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also > > performs 2 stage allocation with the node mask and no node masks. But > > both of them might fail. > > > > The bad news is that other allocation functions - including those that > > allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g. > > __nr_hugepages_store_common paths which use the allocating process > > policy to fill up the pool so the pool could be under provisioned if > > that context is using MPOL_PREFERRED_MANY. > > Thanks for the check! > > So you mean if the prferred nodes don't have enough pages, we should > also fallback to all like dequeue_huge_page_vma() does? > > Or we can user a policy API which return nodemask for MPOL_BIND and > NULL for all other policies, like allowed_mems_nr() needs. > > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > return policy_nodemask(gfp, mpol); > } > > +#ifdef CONFIG_HUGETLB_FS > +static inline nodemask_t *strict_policy_nodemask_current(void) > +{ > + struct mempolicy *mpol = get_task_policy(current); > + > + if (mpol->mode == MPOL_BIND) > + return &mpol->nodes; > + > + return NULL; > +} > +#endif Yes something like this, except that I would also move this into hugetlb proper because this doesn't seem generally useful. > > Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation > > code and I have to admit I do not really remember details there. This is > > a subtle code and my best guess would be that policy_nodemask_current > > should be hugetlb specific and only care about MPOL_BIND. > > The API needed by allowed_mem_nr() is a little different as it has gfp > flag and cpuset config to consider. Why would gfp mask matter?
On Thu 04-08-22 01:14:32, Feng Tang wrote: [...] > Ok, I change it as below: Wouldn't it be better to make this allowed_mems_nr specific to be explicit about the intention? Not that I feel strongly about that. > --- > mm/hugetlb.c | 28 +++++++++++++++++++++++----- > 1 file changed, 23 insertions(+), 5 deletions(-) Not even compile tested include/linux/mempolicy.h | 12 ------------ mm/hugetlb.c | 24 ++++++++++++++++++++---- 2 files changed, 20 insertions(+), 16 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 668389b4b53d..e38b0ef20b8b 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk, const nodemask_t *mask); extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) -{ - struct mempolicy *mpol = get_task_policy(current); - - return policy_nodemask(gfp, mpol); -} - extern unsigned int mempolicy_slab_node(void); extern enum zone_type policy_zone; @@ -294,11 +287,6 @@ static inline void mpol_put_task_policy(struct task_struct *task) { } -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) -{ - return NULL; -} - static inline bool mpol_is_preferred_many(struct mempolicy *pol) { return false; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a18c071c294e..6cacbc9b15a1 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s) } __setup("default_hugepagesz=", default_hugepagesz_setup); +struct mempolicy *policy_mbind_nodemask(gfp_t gfp) +{ +#ifdef CONFIG_MEMPOLICY + struct mempolicy *mpol = get_task_policy(current); + + /* + * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask) + * specifically for hugetlb case + */ + if (mpol->mode == MPOL_BIND && + (apply_policy_zone(mpol, gfp_zone(gfp)) && + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) + return &mpol->nodes; +#endif + return NULL; +} + static unsigned int allowed_mems_nr(struct hstate *h) { int node; unsigned int nr = 0; - nodemask_t *mpol_allowed; + nodemask_t *mbind_nodemask; unsigned int *array = h->free_huge_pages_node; gfp_t gfp_mask = htlb_alloc_mask(h); - mpol_allowed = policy_nodemask_current(gfp_mask); - + mbind_nodemask = policy_mbind_nodemask(gfp_mask); for_each_node_mask(node, cpuset_current_mems_allowed) { - if (!mpol_allowed || node_isset(node, *mpol_allowed)) + if (!mbind_nodemask || node_isset(node, *mbind_nodemask)) nr += array[node]; }
On Thu 04-08-22 04:43:06, Feng Tang wrote: > On Wed, Aug 03, 2022 at 07:28:59PM +0800, Michal Hocko wrote: [...] > > +struct mempolicy *policy_mbind_nodemask(gfp_t gfp) > > +{ > > +#ifdef CONFIG_MEMPOLICY > > + struct mempolicy *mpol = get_task_policy(current); > > + > > + /* > > + * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask) > > + * specifically for hugetlb case > > + */ > > + if (mpol->mode == MPOL_BIND && > > + (apply_policy_zone(mpol, gfp_zone(gfp)) && > > + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > + return &mpol->nodes; > > +#endif > > + return NULL; > > I saw the logic is not changed, and it confused me that if there is > no qualified node, it will still return NULL which effectively equals > node_states[N_MEMORY], while I think it should return a all zero > nodemasks. This is a separate thing and I have to admit that the existing code is rather non-intuitive or even broken. I guess we do not care all that much because MBIND with completely non-overlapping cpusets is just a broken configuration. I am not sure this case is interesting or even supported.
On Thu 04-08-22 05:08:34, Feng Tang wrote: [...] > Do we still need the other nodemask API I proposed earlier which has > no parameter of gfp_flag, and used for __nr_hugepages_store_common? I would touch as little code as possible.
On Wed, Aug 03, 2022 at 03:36:41PM +0800, Michal Hocko wrote: > On Wed 03-08-22 14:41:20, Feng Tang wrote: > > On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote: > > > Please make sure to CC Mike on hugetlb related changes. > > > > OK. > > > > > I didn't really get to grasp your proposed solution but it feels goind > > > sideways. The real issue is that hugetlb uses a dedicated allocation > > > scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not > > > think we should be tricking that by providing some fake nodemasks and > > > what not. > > > > > > The good news is that allocation from the pool is MPOL_PREFERRED_MANY > > > aware because it first tries to allocation from the preffered node mask > > > and then fall back to the full nodemask (dequeue_huge_page_vma). > > > If the existing pools cannot really satisfy that allocation then it > > > tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also > > > performs 2 stage allocation with the node mask and no node masks. But > > > both of them might fail. > > > > > > The bad news is that other allocation functions - including those that > > > allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g. > > > __nr_hugepages_store_common paths which use the allocating process > > > policy to fill up the pool so the pool could be under provisioned if > > > that context is using MPOL_PREFERRED_MANY. > > > > Thanks for the check! > > > > So you mean if the prferred nodes don't have enough pages, we should > > also fallback to all like dequeue_huge_page_vma() does? > > > > Or we can user a policy API which return nodemask for MPOL_BIND and > > NULL for all other policies, like allowed_mems_nr() needs. > > > > --- a/include/linux/mempolicy.h > > +++ b/include/linux/mempolicy.h > > @@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > > return policy_nodemask(gfp, mpol); > > } > > > > +#ifdef CONFIG_HUGETLB_FS > > +static inline nodemask_t *strict_policy_nodemask_current(void) > > +{ > > + struct mempolicy *mpol = get_task_policy(current); > > + > > + if (mpol->mode == MPOL_BIND) > > + return &mpol->nodes; > > + > > + return NULL; > > +} > > +#endif > > Yes something like this, except that I would also move this into hugetlb > proper because this doesn't seem generally useful. Ok, I change it as below: --- mm/hugetlb.c | 28 +++++++++++++++++++++++----- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 14be38822cf8..ef1d4ffa733f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -91,6 +91,24 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp; /* Forward declaration */ static int hugetlb_acct_memory(struct hstate *h, long delta); +/* + * Return nodemask of what is allowed by current process' memory + * policy, as MPOL_BIND is the only 'strict' policy, return NULL + * for all other policies + */ +static inline nodemask_t *allowed_policy_nodemask_current(void) +{ +#ifdef CONFIG_NUMA + struct mempolicy *mpol = get_task_policy(current); + + if (mpol->mode == MPOL_BIND) + return &mpol->nodes; + return NULL; +#else + return NULL; +#endif +} + static inline bool subpool_is_free(struct hugepage_subpool *spool) { if (spool->count) @@ -3556,7 +3574,7 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy, unsigned long count, size_t len) { int err; - nodemask_t nodes_allowed, *n_mask; + nodemask_t nodes_allowed, *n_mask = NULL; if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) return -EINVAL; @@ -3565,11 +3583,11 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy, /* * global hstate attribute */ - if (!(obey_mempolicy && - init_nodemask_of_mempolicy(&nodes_allowed))) + if (obey_mempolicy) + n_mask = allowed_policy_nodemask_current(); + + if (!n_mask) n_mask = &node_states[N_MEMORY]; - else - n_mask = &nodes_allowed; } else { /* * Node specific request. count adjustment happens in
On Wed, Aug 03, 2022 at 07:28:59PM +0800, Michal Hocko wrote: > On Thu 04-08-22 01:14:32, Feng Tang wrote: > [...] > > Ok, I change it as below: > > Wouldn't it be better to make this allowed_mems_nr specific to be > explicit about the intention? Yes, it is. > Not that I feel strongly about that. > > > --- > > mm/hugetlb.c | 28 +++++++++++++++++++++++----- > > 1 file changed, 23 insertions(+), 5 deletions(-) > > Not even compile tested > include/linux/mempolicy.h | 12 ------------ > mm/hugetlb.c | 24 ++++++++++++++++++++---- > 2 files changed, 20 insertions(+), 16 deletions(-) > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 668389b4b53d..e38b0ef20b8b 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk, > const nodemask_t *mask); > extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); > > -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > -{ > - struct mempolicy *mpol = get_task_policy(current); > - > - return policy_nodemask(gfp, mpol); > -} > - > extern unsigned int mempolicy_slab_node(void); > > extern enum zone_type policy_zone; > @@ -294,11 +287,6 @@ static inline void mpol_put_task_policy(struct task_struct *task) > { > } > > -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > -{ > - return NULL; > -} > - > static inline bool mpol_is_preferred_many(struct mempolicy *pol) > { > return false; > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index a18c071c294e..6cacbc9b15a1 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s) > } > __setup("default_hugepagesz=", default_hugepagesz_setup); > > +struct mempolicy *policy_mbind_nodemask(gfp_t gfp) > +{ > +#ifdef CONFIG_MEMPOLICY > + struct mempolicy *mpol = get_task_policy(current); > + > + /* > + * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask) > + * specifically for hugetlb case > + */ > + if (mpol->mode == MPOL_BIND && > + (apply_policy_zone(mpol, gfp_zone(gfp)) && > + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > + return &mpol->nodes; > +#endif > + return NULL; I saw the logic is not changed, and it confused me that if there is no qualified node, it will still return NULL which effectively equals node_states[N_MEMORY], while I think it should return a all zero nodemasks. Thanks, Feng > +} > + > static unsigned int allowed_mems_nr(struct hstate *h) > { > int node; > unsigned int nr = 0; > - nodemask_t *mpol_allowed; > + nodemask_t *mbind_nodemask; > unsigned int *array = h->free_huge_pages_node; > gfp_t gfp_mask = htlb_alloc_mask(h); > > - mpol_allowed = policy_nodemask_current(gfp_mask); > - > + mbind_nodemask = policy_mbind_nodemask(gfp_mask); > for_each_node_mask(node, cpuset_current_mems_allowed) { > - if (!mpol_allowed || node_isset(node, *mpol_allowed)) > + if (!mbind_nodemask || node_isset(node, *mbind_nodemask)) > nr += array[node]; > } > > -- > Michal Hocko > SUSE Labs
On Wed, Aug 03, 2022 at 08:56:44PM +0800, Michal Hocko wrote: > On Thu 04-08-22 04:43:06, Feng Tang wrote: > > On Wed, Aug 03, 2022 at 07:28:59PM +0800, Michal Hocko wrote: > [...] > > > +struct mempolicy *policy_mbind_nodemask(gfp_t gfp) > > > +{ > > > +#ifdef CONFIG_MEMPOLICY > > > + struct mempolicy *mpol = get_task_policy(current); > > > + > > > + /* > > > + * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask) > > > + * specifically for hugetlb case > > > + */ > > > + if (mpol->mode == MPOL_BIND && > > > + (apply_policy_zone(mpol, gfp_zone(gfp)) && > > > + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > > + return &mpol->nodes; > > > +#endif > > > + return NULL; > > > > I saw the logic is not changed, and it confused me that if there is > > no qualified node, it will still return NULL which effectively equals > > node_states[N_MEMORY], while I think it should return a all zero > > nodemasks. > > This is a separate thing and I have to admit that the existing code is > rather non-intuitive or even broken. I guess we do not care all that > much because MBIND with completely non-overlapping cpusets is just a > broken configuration. I am not sure this case is interesting or even > supported. Fair enough, and moving the policy_mbind_nodemask() into hugetlb.c for one single caller make it much less severe. Do we still need the other nodemask API I proposed earlier which has no parameter of gfp_flag, and used for __nr_hugepages_store_common? Thanks, Feng > -- > Michal Hocko > SUSE Labs >
On Wed, Aug 03, 2022 at 09:21:22PM +0800, Michal Hocko wrote: > On Thu 04-08-22 05:08:34, Feng Tang wrote: > [...] > > Do we still need the other nodemask API I proposed earlier which has > > no parameter of gfp_flag, and used for __nr_hugepages_store_common? > > I would touch as little code as possible. OK. Please review the following patch, thanks! - Feng --- From a2db9a57da616bb3ea21e48a4a9ceb5c2cf4f7a2 Mon Sep 17 00:00:00 2001 From: Feng Tang <feng.tang@intel.com> Date: Thu, 4 Aug 2022 09:39:24 +0800 Subject: [PATCH RFC] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") [1], the policy_nodemask_current()'s semantics for this new policy has been changed, which returns 'preferred' nodes instead of 'allowed' nodes, and could hurt the usage of its caller in hugetlb: allowed_mems_nr(). Michal found the policy_nodemask_current() is only used by hugetlb, and suggested to move it to hugetlb code with more explicit name to enforce the 'allowed' semantics for which only MPOL_BIND policy matters. One note for the new policy_mbind_nodemask() is, the cross check from MPOL_BIND, gfp flags and cpuset configuration can lead to a no available node case, which is considered to be broken configuration and 'NULL' (equals all nodes) is returned. [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/ Reported-by: Muchun Song <songmuchun@bytedance.com> Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Feng Tang <feng.tang@intel.com> --- include/linux/mempolicy.h | 32 ++++++++++++++++++++------------ mm/hugetlb.c | 24 ++++++++++++++++++++---- mm/mempolicy.c | 20 -------------------- 3 files changed, 40 insertions(+), 36 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 668389b4b53d..ea0168fffb4c 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk, const nodemask_t *mask); extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) -{ - struct mempolicy *mpol = get_task_policy(current); - - return policy_nodemask(gfp, mpol); -} - extern unsigned int mempolicy_slab_node(void); extern enum zone_type policy_zone; @@ -189,6 +182,26 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol) return (pol->mode == MPOL_PREFERRED_MANY); } +static inline int apply_policy_zone(struct mempolicy *policy, enum zone_type zone) +{ + enum zone_type dynamic_policy_zone = policy_zone; + + BUG_ON(dynamic_policy_zone == ZONE_MOVABLE); + + /* + * if policy->nodes has movable memory only, + * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only. + * + * policy->nodes is intersect with node_states[N_MEMORY]. + * so if the following test fails, it implies + * policy->nodes has movable memory only. + */ + if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY])) + dynamic_policy_zone = ZONE_MOVABLE; + + return zone >= dynamic_policy_zone; +} + #else @@ -294,11 +307,6 @@ static inline void mpol_put_task_policy(struct task_struct *task) { } -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) -{ - return NULL; -} - static inline bool mpol_is_preferred_many(struct mempolicy *pol) { return false; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a18c071c294e..ad84bb85b6de 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s) } __setup("default_hugepagesz=", default_hugepagesz_setup); +static nodemask_t *policy_mbind_nodemask(gfp_t gfp) +{ +#ifdef CONFIG_NUMA + struct mempolicy *mpol = get_task_policy(current); + + /* + * Only enforce MPOL_BIND policy which overlaps with cpuset policy + * (from policy_nodemask) specifically for hugetlb case + */ + if (mpol->mode == MPOL_BIND && + (apply_policy_zone(mpol, gfp_zone(gfp)) && + cpuset_nodemask_valid_mems_allowed(&mpol->nodes))) + return &mpol->nodes; +#endif + return NULL; +} + static unsigned int allowed_mems_nr(struct hstate *h) { int node; unsigned int nr = 0; - nodemask_t *mpol_allowed; + nodemask_t *mbind_nodemask; unsigned int *array = h->free_huge_pages_node; gfp_t gfp_mask = htlb_alloc_mask(h); - mpol_allowed = policy_nodemask_current(gfp_mask); - + mbind_nodemask = policy_mbind_nodemask(gfp_mask); for_each_node_mask(node, cpuset_current_mems_allowed) { - if (!mpol_allowed || node_isset(node, *mpol_allowed)) + if (!mbind_nodemask || node_isset(node, *mbind_nodemask)) nr += array[node]; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d39b01fd52fe..5553bd53927f 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1805,26 +1805,6 @@ bool vma_policy_mof(struct vm_area_struct *vma) return pol->flags & MPOL_F_MOF; } -static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone) -{ - enum zone_type dynamic_policy_zone = policy_zone; - - BUG_ON(dynamic_policy_zone == ZONE_MOVABLE); - - /* - * if policy->nodes has movable memory only, - * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only. - * - * policy->nodes is intersect with node_states[N_MEMORY]. - * so if the following test fails, it implies - * policy->nodes has movable memory only. - */ - if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY])) - dynamic_policy_zone = ZONE_MOVABLE; - - return zone >= dynamic_policy_zone; -} - /* * Return a nodemask representing a mempolicy for filtering nodes for * page allocation
On Thu 04-08-22 16:27:24, Feng Tang wrote: [...] > >From a2db9a57da616bb3ea21e48a4a9ceb5c2cf4f7a2 Mon Sep 17 00:00:00 2001 > From: Feng Tang <feng.tang@intel.com> > Date: Thu, 4 Aug 2022 09:39:24 +0800 > Subject: [PATCH RFC] mm/hugetlb: add dedicated func to get 'allowed' nodemask for > current process > > Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced > in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > [1], the policy_nodemask_current()'s semantics for this new policy > has been changed, which returns 'preferred' nodes instead of 'allowed' > nodes, and could hurt the usage of its caller in hugetlb: > allowed_mems_nr(). > > Michal found the policy_nodemask_current() is only used by hugetlb, > and suggested to move it to hugetlb code with more explicit name to > enforce the 'allowed' semantics for which only MPOL_BIND policy > matters. > > One note for the new policy_mbind_nodemask() is, the cross check > from MPOL_BIND, gfp flags and cpuset configuration can lead to > a no available node case, which is considered to be broken > configuration and 'NULL' (equals all nodes) is returned. > > [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/ > Reported-by: Muchun Song <songmuchun@bytedance.com> > Suggested-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Feng Tang <feng.tang@intel.com> LGTM I would just make apply_policy_zone extern rather than making it static inline in a header which can turn out to cause other header dependencies. Thanks!
diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 6c27acb6cd63..4deec7e598c6 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) cpuset_nodemask_valid_mems_allowed(&policy->nodes)) return &policy->nodes; - if (mode == MPOL_PREFERRED_MANY) - return &policy->nodes; - return NULL; }
policy_nodemask() is supposed to be returned a nodemask representing a mempolicy for filtering nodes for page allocation, which is a hard restriction (see the user of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from policy_nodemask() to fix it, which will not affect current users of policy_nodemask() since all of the users already have handled the case of MPOL_PREFERRED_MANY before calling it. BTW, it is found by code inspection. Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") Signed-off-by: Muchun Song <songmuchun@bytedance.com> --- mm/mempolicy.c | 3 --- 1 file changed, 3 deletions(-)