diff mbox series

mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case

Message ID 20220801084207.39086-1-songmuchun@bytedance.com (mailing list archive)
State New
Headers show
Series mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case | expand

Commit Message

Muchun Song Aug. 1, 2022, 8:42 a.m. UTC
policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
for filtering nodes for page allocation, which is a hard restriction (see the user
of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
since all of the users already have handled the case of MPOL_PREFERRED_MANY before
calling it.  BTW, it is found by code inspection.

Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/mempolicy.c | 3 ---
 1 file changed, 3 deletions(-)

Comments

Michal Hocko Aug. 1, 2022, 9:06 a.m. UTC | #1
On Mon 01-08-22 16:42:07, Muchun Song wrote:
> policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
> for filtering nodes for page allocation, which is a hard restriction (see the user
> of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
> mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
> policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
> since all of the users already have handled the case of MPOL_PREFERRED_MANY before
> calling it.  BTW, it is found by code inspection.

I am not sure this is the right fix. It is quite true that
policy_nodemask is a tricky function to use. It pretends to have a
higher level logic but all existing users are expected to be policy
aware and they special case allocation for each policy. That would mean
that hugetlb should do the same.

I haven't checked the actual behavior implications for hugetlb here. Is
MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it
work? From a quick look this just ignores MPOL_PREFERRED_MANY
completely.
 
> Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/mempolicy.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 6c27acb6cd63..4deec7e598c6 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
>  		return &policy->nodes;
>  
> -	if (mode == MPOL_PREFERRED_MANY)
> -		return &policy->nodes;
> -
>  	return NULL;
>  }
>  
> -- 
> 2.11.0
Feng Tang Aug. 1, 2022, 9:26 a.m. UTC | #2
On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote:
> On Mon 01-08-22 16:42:07, Muchun Song wrote:
> > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
> > for filtering nodes for page allocation, which is a hard restriction (see the user
> > of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
> > mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
> > policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
> > since all of the users already have handled the case of MPOL_PREFERRED_MANY before
> > calling it.  BTW, it is found by code inspection.
> 
> I am not sure this is the right fix. It is quite true that
> policy_nodemask is a tricky function to use. It pretends to have a
> higher level logic but all existing users are expected to be policy
> aware and they special case allocation for each policy. That would mean
> that hugetlb should do the same.

Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also
confused about policy_nodemask(), as it is never a 'strict' one as
the old code is:

	if (unlikely(mode == MPOL_BIND) &&
		apply_policy_zone(policy, gfp_zone(gfp)) &&
		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
		return &policy->nodes;

	return NULL

Even when the MPOL_BIND's nodes is not allowed by cpuset, it will 
still return NULL (equals all nodes).

From the semantics of allowed_mems_nr(), I think it does get changed
a little by b27abaccf8e8. And to enforce the 'strict' semantic for
'allowed', we may need a more strict nodemask API for it.

> I haven't checked the actual behavior implications for hugetlb here. Is
> MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it
> work? From a quick look this just ignores MPOL_PREFERRED_MANY
> completely.

IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double
check and report back if otherwise.

> > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  mm/mempolicy.c | 3 ---
> >  1 file changed, 3 deletions(-)
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 6c27acb6cd63..4deec7e598c6 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> >  		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> >  		return &policy->nodes;
> >  
> > -	if (mode == MPOL_PREFERRED_MANY)
> > -		return &policy->nodes;

I think it will make MPOL_PREFERRED_MANY not usable.

Thanks,
Feng

> > -
> >  	return NULL;
> >  }
> >  
> > -- 
> > 2.11.0
> 
> -- 
> Michal Hocko
> SUSE Labs
Muchun Song Aug. 2, 2022, 3:42 a.m. UTC | #3
On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote:
> On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote:
> > On Mon 01-08-22 16:42:07, Muchun Song wrote:
> > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
> > > for filtering nodes for page allocation, which is a hard restriction (see the user
> > > of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
> > > mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
> > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
> > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before
> > > calling it.  BTW, it is found by code inspection.
> > 
> > I am not sure this is the right fix. It is quite true that
> > policy_nodemask is a tricky function to use. It pretends to have a
> > higher level logic but all existing users are expected to be policy
> > aware and they special case allocation for each policy. That would mean
> > that hugetlb should do the same.
> 
> Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also
> confused about policy_nodemask(), as it is never a 'strict' one as
> the old code is:
> 
> 	if (unlikely(mode == MPOL_BIND) &&
> 		apply_policy_zone(policy, gfp_zone(gfp)) &&
> 		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> 		return &policy->nodes;
> 
> 	return NULL
> 
> Even when the MPOL_BIND's nodes is not allowed by cpuset, it will 
> still return NULL (equals all nodes).
>

Well, I agree policy_nodemask() is really confusing because of the
shortage of comments and the weird logic.

> From the semantics of allowed_mems_nr(), I think it does get changed
> a little by b27abaccf8e8. And to enforce the 'strict' semantic for
> 'allowed', we may need a more strict nodemask API for it.
>

Maybe this is a good idea to fix this, e.g. introducing a new helper
to return the strict allowed nodemask.

> > I haven't checked the actual behavior implications for hugetlb here. Is
> > MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it
> > work? From a quick look this just ignores MPOL_PREFERRED_MANY
> > completely.
> 
> IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double
> check and report back if otherwise.
>
> > > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > ---
> > >  mm/mempolicy.c | 3 ---
> > >  1 file changed, 3 deletions(-)
> > > 
> > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > index 6c27acb6cd63..4deec7e598c6 100644
> > > --- a/mm/mempolicy.c
> > > +++ b/mm/mempolicy.c
> > > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > >  		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > >  		return &policy->nodes;
> > >  
> > > -	if (mode == MPOL_PREFERRED_MANY)
> > > -		return &policy->nodes;
> 
> I think it will make MPOL_PREFERRED_MANY not usable.
>

Sorry, I didn't got what you mean here. Could you explain more details
about why it is not usable?

Thanks.

> Thanks,
> Feng
> 
> > > -
> > >  	return NULL;
> > >  }
> > >  
> > > -- 
> > > 2.11.0
> > 
> > -- 
> > Michal Hocko
> > SUSE Labs
>
Feng Tang Aug. 2, 2022, 5:52 a.m. UTC | #4
On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote:
> On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote:
> > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote:
> > > On Mon 01-08-22 16:42:07, Muchun Song wrote:
> > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
> > > > for filtering nodes for page allocation, which is a hard restriction (see the user
> > > > of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
> > > > mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
> > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
> > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before
> > > > calling it.  BTW, it is found by code inspection.
> > > 
> > > I am not sure this is the right fix. It is quite true that
> > > policy_nodemask is a tricky function to use. It pretends to have a
> > > higher level logic but all existing users are expected to be policy
> > > aware and they special case allocation for each policy. That would mean
> > > that hugetlb should do the same.
> > 
> > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also
> > confused about policy_nodemask(), as it is never a 'strict' one as
> > the old code is:
> > 
> > 	if (unlikely(mode == MPOL_BIND) &&
> > 		apply_policy_zone(policy, gfp_zone(gfp)) &&
> > 		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > 		return &policy->nodes;
> > 
> > 	return NULL
> > 
> > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will 
> > still return NULL (equals all nodes).
> >
> 
> Well, I agree policy_nodemask() is really confusing because of the
> shortage of comments and the weird logic.
> 
> > From the semantics of allowed_mems_nr(), I think it does get changed
> > a little by b27abaccf8e8. And to enforce the 'strict' semantic for
> > 'allowed', we may need a more strict nodemask API for it.
> >
> 
> Maybe this is a good idea to fix this, e.g. introducing a new helper
> to return the strict allowed nodemask.

Yep. 

I had another thought to add one global all-zero nodemask, for API like
policy_nodemask(), it has 2 types of return value:
* a nodemask with some bits set
* NULL (means all nodes)

Here a new type of zero nodemask (a gloabl variable)can be created to
indicate no qualified node.

> > > I haven't checked the actual behavior implications for hugetlb here. Is
> > > MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it
> > > work? From a quick look this just ignores MPOL_PREFERRED_MANY
> > > completely.
> > 
> > IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double
> > check and report back if otherwise.
> >
> > > > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > ---
> > > >  mm/mempolicy.c | 3 ---
> > > >  1 file changed, 3 deletions(-)
> > > > 
> > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > > index 6c27acb6cd63..4deec7e598c6 100644
> > > > --- a/mm/mempolicy.c
> > > > +++ b/mm/mempolicy.c
> > > > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > >  		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > > >  		return &policy->nodes;
> > > >  
> > > > -	if (mode == MPOL_PREFERRED_MANY)
> > > > -		return &policy->nodes;
> > 
> > I think it will make MPOL_PREFERRED_MANY not usable.
> >
> 
> Sorry, I didn't got what you mean here. Could you explain more details
> about why it is not usable?
 
I thought alloc_pages() will rely on policy_nodemask(), which was wrong
as I forgot the MPOL_PREFERRED_MANY has a dedicated function
alloc_pages_preferred_many() to handle it. Sorry for the confusion.

Thanks,
Feng

> Thanks.
> 
> > Thanks,
> > Feng
> > 
> > > > -
> > > >  	return NULL;
> > > >  }
> > > >  
> > > > -- 
> > > > 2.11.0
> > > 
> > > -- 
> > > Michal Hocko
> > > SUSE Labs
> > 
>
Muchun Song Aug. 2, 2022, 6:40 a.m. UTC | #5
On Tue, Aug 02, 2022 at 01:52:05PM +0800, Feng Tang wrote:
> On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote:
> > On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote:
> > > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote:
> > > > On Mon 01-08-22 16:42:07, Muchun Song wrote:
> > > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
> > > > > for filtering nodes for page allocation, which is a hard restriction (see the user
> > > > > of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
> > > > > mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
> > > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
> > > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before
> > > > > calling it.  BTW, it is found by code inspection.
> > > > 
> > > > I am not sure this is the right fix. It is quite true that
> > > > policy_nodemask is a tricky function to use. It pretends to have a
> > > > higher level logic but all existing users are expected to be policy
> > > > aware and they special case allocation for each policy. That would mean
> > > > that hugetlb should do the same.
> > > 
> > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also
> > > confused about policy_nodemask(), as it is never a 'strict' one as
> > > the old code is:
> > > 
> > > 	if (unlikely(mode == MPOL_BIND) &&
> > > 		apply_policy_zone(policy, gfp_zone(gfp)) &&
> > > 		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > > 		return &policy->nodes;
> > > 
> > > 	return NULL
> > > 
> > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will 
> > > still return NULL (equals all nodes).
> > >
> > 
> > Well, I agree policy_nodemask() is really confusing because of the
> > shortage of comments and the weird logic.
> > 
> > > From the semantics of allowed_mems_nr(), I think it does get changed
> > > a little by b27abaccf8e8. And to enforce the 'strict' semantic for
> > > 'allowed', we may need a more strict nodemask API for it.
> > >
> > 
> > Maybe this is a good idea to fix this, e.g. introducing a new helper
> > to return the strict allowed nodemask.
> 
> Yep. 
> 
> I had another thought to add one global all-zero nodemask, for API like
> policy_nodemask(), it has 2 types of return value:
> * a nodemask with some bits set
> * NULL (means all nodes)
> 
> Here a new type of zero nodemask (a gloabl variable)can be created to
> indicate no qualified node.
>

I know why you want to introduce a gloable zero nidemask. Since we already
have a glable nodemask array, namely node_states, instead of returning NULL
for the case of all nodes, how about returing node_states[N_ONLINE] for it?
And make it return NULL for the case where no nodes are allowed. Any thought?
 
> > > > I haven't checked the actual behavior implications for hugetlb here. Is
> > > > MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it
> > > > work? From a quick look this just ignores MPOL_PREFERRED_MANY
> > > > completely.
> > > 
> > > IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double
> > > check and report back if otherwise.
> > >
> > > > > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > > ---
> > > > >  mm/mempolicy.c | 3 ---
> > > > >  1 file changed, 3 deletions(-)
> > > > > 
> > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > > > index 6c27acb6cd63..4deec7e598c6 100644
> > > > > --- a/mm/mempolicy.c
> > > > > +++ b/mm/mempolicy.c
> > > > > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > > >  		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > > > >  		return &policy->nodes;
> > > > >  
> > > > > -	if (mode == MPOL_PREFERRED_MANY)
> > > > > -		return &policy->nodes;
> > > 
> > > I think it will make MPOL_PREFERRED_MANY not usable.
> > >
> > 
> > Sorry, I didn't got what you mean here. Could you explain more details
> > about why it is not usable?
>  
> I thought alloc_pages() will rely on policy_nodemask(), which was wrong
> as I forgot the MPOL_PREFERRED_MANY has a dedicated function
> alloc_pages_preferred_many() to handle it. Sorry for the confusion.
> 
> Thanks,
> Feng
> 
> > Thanks.
> > 
> > > Thanks,
> > > Feng
> > > 
> > > > > -
> > > > >  	return NULL;
> > > > >  }
> > > > >  
> > > > > -- 
> > > > > 2.11.0
> > > > 
> > > > -- 
> > > > Michal Hocko
> > > > SUSE Labs
> > > 
> > 
>
Feng Tang Aug. 2, 2022, 7:39 a.m. UTC | #6
On Tue, Aug 02, 2022 at 02:40:11PM +0800, Muchun Song wrote:
> On Tue, Aug 02, 2022 at 01:52:05PM +0800, Feng Tang wrote:
> > On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote:
> > > On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote:
> > > > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote:
> > > > > On Mon 01-08-22 16:42:07, Muchun Song wrote:
> > > > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
> > > > > > for filtering nodes for page allocation, which is a hard restriction (see the user
> > > > > > of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
> > > > > > mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
> > > > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
> > > > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before
> > > > > > calling it.  BTW, it is found by code inspection.
> > > > > 
> > > > > I am not sure this is the right fix. It is quite true that
> > > > > policy_nodemask is a tricky function to use. It pretends to have a
> > > > > higher level logic but all existing users are expected to be policy
> > > > > aware and they special case allocation for each policy. That would mean
> > > > > that hugetlb should do the same.
> > > > 
> > > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also
> > > > confused about policy_nodemask(), as it is never a 'strict' one as
> > > > the old code is:
> > > > 
> > > > 	if (unlikely(mode == MPOL_BIND) &&
> > > > 		apply_policy_zone(policy, gfp_zone(gfp)) &&
> > > > 		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > > > 		return &policy->nodes;
> > > > 
> > > > 	return NULL
> > > > 
> > > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will 
> > > > still return NULL (equals all nodes).
> > > >
> > > 
> > > Well, I agree policy_nodemask() is really confusing because of the
> > > shortage of comments and the weird logic.
> > > 
> > > > From the semantics of allowed_mems_nr(), I think it does get changed
> > > > a little by b27abaccf8e8. And to enforce the 'strict' semantic for
> > > > 'allowed', we may need a more strict nodemask API for it.
> > > >
> > > 
> > > Maybe this is a good idea to fix this, e.g. introducing a new helper
> > > to return the strict allowed nodemask.
> > 
> > Yep. 
> > 
> > I had another thought to add one global all-zero nodemask, for API like
> > policy_nodemask(), it has 2 types of return value:
> > * a nodemask with some bits set
> > * NULL (means all nodes)
> > 
> > Here a new type of zero nodemask (a gloabl variable)can be created to
> > indicate no qualified node.
> >
> 
> I know why you want to introduce a gloable zero nidemask. Since we already
> have a glable nodemask array, namely node_states, instead of returning NULL
> for the case of all nodes, how about returing node_states[N_ONLINE] for it?
> And make it return NULL for the case where no nodes are allowed. Any thought?

I think return node_states[N_ONLINE] can simplify the code in allowed_mems_nr(),
the empty zero nodes can simplify further.

Here is some draft patch (not tested) to show the idea

Thanks,
Feng

---
 include/linux/mempolicy.h |  8 ++++++++
 include/linux/nodemask.h  |  7 +++++++
 mm/hugetlb.c              |  7 ++++---
 mm/mempolicy.c            | 17 +++++++++++++++++
 mm/page_alloc.c           |  3 +++
 5 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 668389b4b53d..b5451fef1620 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -150,6 +150,7 @@ extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
 extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
 				const nodemask_t *mask);
 extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
+extern nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy);
 
 static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
 {
@@ -158,6 +159,13 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
 	return policy_nodemask(gfp, mpol);
 }
 
+static inline nodemask_t *allowed_policy_nodemask_current(gfp_t gfp)
+{
+	struct mempolicy *mpol = get_task_policy(current);
+
+	return allowed_policy_nodemask(gfp, mpol);
+}
+
 extern unsigned int mempolicy_slab_node(void);
 
 extern enum zone_type policy_zone;
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 0f233b76c9ce..dc5fab38e810 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -409,6 +409,13 @@ enum node_states {
 
 extern nodemask_t node_states[NR_NODE_STATES];
 
+extern nodemask_t zero_nodes;
+
+static inline bool is_empty_nodes(nodemask_t *pnodes)
+{
+	 return (pnodes == &zero_nodes || __nodes_empty(pnodes, MAX_NUMNODES));
+}
+
 #if MAX_NUMNODES > 1
 static inline int node_state(int node, enum node_states state)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a57e1be41401..dc9f4ed32909 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4340,10 +4340,11 @@ static unsigned int allowed_mems_nr(struct hstate *h)
 
 	mpol_allowed = policy_nodemask_current(gfp_mask);
 
-	for_each_node_mask(node, cpuset_current_mems_allowed) {
-		if (!mpol_allowed || node_isset(node, *mpol_allowed))
+	if (is_empty_nodes(mpol_allowed))
+		return 0;
+
+	for_each_node_mask(node, mpol_allowed)
 			nr += array[node];
-	}
 
 	return nr;
 }
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d39b01fd52fe..3e936b8ca9ea 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1845,6 +1845,23 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 	return NULL;
 }
 
+/*
+ * Return the allowed nodes mask for a mempolicy and page allocation,
+ * which is a 'stricter' semantic than policy_nodemsk()
+ */
+nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy)
+{
+	if (unlikely(policy->mode == MPOL_BIND)) {
+		if (apply_policy_zone(policy, gfp_zone(gfp)) &&
+			cpuset_nodemask_valid_mems_allowed(&policy->nodes))
+			return &policy->nodes;
+		else
+			return &zero_nodes;
+	}
+
+	return NULL;
+}
+
 /*
  * Return the  preferred node id for 'prefer' mempolicy, and return
  * the given id for all other policies.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e008a3df0485..3549ea037588 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -181,6 +181,9 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
 };
 EXPORT_SYMBOL(node_states);
 
+nodemask_t zero_nodes = NODE_MASK_NONE;
+EXPORT_SYMBOL(zero_nodes);
+
 atomic_long_t _totalram_pages __read_mostly;
 EXPORT_SYMBOL(_totalram_pages);
 unsigned long totalreserve_pages __read_mostly;
Michal Hocko Aug. 2, 2022, 9:02 a.m. UTC | #7
Please make sure to CC Mike on hugetlb related changes.

I didn't really get to grasp your proposed solution but it feels goind
sideways. The real issue is that hugetlb uses a dedicated allocation
scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not
think we should be tricking that by providing some fake nodemasks and
what not.

The good news is that allocation from the pool is MPOL_PREFERRED_MANY
aware because it first tries to allocation from the preffered node mask
and then fall back to the full nodemask (dequeue_huge_page_vma).
If the existing pools cannot really satisfy that allocation then it
tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also
performs 2 stage allocation with the node mask and no node masks. But
both of them might fail.

The bad news is that other allocation functions - including those that
allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g.
__nr_hugepages_store_common paths which use the allocating process
policy to fill up the pool so the pool could be under provisioned if
that context is using MPOL_PREFERRED_MANY.

Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation
code and I have to admit I do not really remember details there. This is
a subtle code and my best guess would be that policy_nodemask_current
should be hugetlb specific and only care about MPOL_BIND.

On Tue 02-08-22 15:39:52, Feng Tang wrote:
> On Tue, Aug 02, 2022 at 02:40:11PM +0800, Muchun Song wrote:
> > On Tue, Aug 02, 2022 at 01:52:05PM +0800, Feng Tang wrote:
> > > On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote:
> > > > On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote:
> > > > > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote:
> > > > > > On Mon 01-08-22 16:42:07, Muchun Song wrote:
> > > > > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
> > > > > > > for filtering nodes for page allocation, which is a hard restriction (see the user
> > > > > > > of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
> > > > > > > mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
> > > > > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
> > > > > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before
> > > > > > > calling it.  BTW, it is found by code inspection.
> > > > > > 
> > > > > > I am not sure this is the right fix. It is quite true that
> > > > > > policy_nodemask is a tricky function to use. It pretends to have a
> > > > > > higher level logic but all existing users are expected to be policy
> > > > > > aware and they special case allocation for each policy. That would mean
> > > > > > that hugetlb should do the same.
> > > > > 
> > > > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also
> > > > > confused about policy_nodemask(), as it is never a 'strict' one as
> > > > > the old code is:
> > > > > 
> > > > > 	if (unlikely(mode == MPOL_BIND) &&
> > > > > 		apply_policy_zone(policy, gfp_zone(gfp)) &&
> > > > > 		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > > > > 		return &policy->nodes;
> > > > > 
> > > > > 	return NULL
> > > > > 
> > > > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will 
> > > > > still return NULL (equals all nodes).
> > > > >
> > > > 
> > > > Well, I agree policy_nodemask() is really confusing because of the
> > > > shortage of comments and the weird logic.
> > > > 
> > > > > From the semantics of allowed_mems_nr(), I think it does get changed
> > > > > a little by b27abaccf8e8. And to enforce the 'strict' semantic for
> > > > > 'allowed', we may need a more strict nodemask API for it.
> > > > >
> > > > 
> > > > Maybe this is a good idea to fix this, e.g. introducing a new helper
> > > > to return the strict allowed nodemask.
> > > 
> > > Yep. 
> > > 
> > > I had another thought to add one global all-zero nodemask, for API like
> > > policy_nodemask(), it has 2 types of return value:
> > > * a nodemask with some bits set
> > > * NULL (means all nodes)
> > > 
> > > Here a new type of zero nodemask (a gloabl variable)can be created to
> > > indicate no qualified node.
> > >
> > 
> > I know why you want to introduce a gloable zero nidemask. Since we already
> > have a glable nodemask array, namely node_states, instead of returning NULL
> > for the case of all nodes, how about returing node_states[N_ONLINE] for it?
> > And make it return NULL for the case where no nodes are allowed. Any thought?
> 
> I think return node_states[N_ONLINE] can simplify the code in allowed_mems_nr(),
> the empty zero nodes can simplify further.
> 
> Here is some draft patch (not tested) to show the idea
> 
> Thanks,
> Feng
> 
> ---
>  include/linux/mempolicy.h |  8 ++++++++
>  include/linux/nodemask.h  |  7 +++++++
>  mm/hugetlb.c              |  7 ++++---
>  mm/mempolicy.c            | 17 +++++++++++++++++
>  mm/page_alloc.c           |  3 +++
>  5 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 668389b4b53d..b5451fef1620 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -150,6 +150,7 @@ extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
>  extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
>  				const nodemask_t *mask);
>  extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
> +extern nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy);
>  
>  static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
>  {
> @@ -158,6 +159,13 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
>  	return policy_nodemask(gfp, mpol);
>  }
>  
> +static inline nodemask_t *allowed_policy_nodemask_current(gfp_t gfp)
> +{
> +	struct mempolicy *mpol = get_task_policy(current);
> +
> +	return allowed_policy_nodemask(gfp, mpol);
> +}
> +
>  extern unsigned int mempolicy_slab_node(void);
>  
>  extern enum zone_type policy_zone;
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index 0f233b76c9ce..dc5fab38e810 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -409,6 +409,13 @@ enum node_states {
>  
>  extern nodemask_t node_states[NR_NODE_STATES];
>  
> +extern nodemask_t zero_nodes;
> +
> +static inline bool is_empty_nodes(nodemask_t *pnodes)
> +{
> +	 return (pnodes == &zero_nodes || __nodes_empty(pnodes, MAX_NUMNODES));
> +}
> +
>  #if MAX_NUMNODES > 1
>  static inline int node_state(int node, enum node_states state)
>  {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a57e1be41401..dc9f4ed32909 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4340,10 +4340,11 @@ static unsigned int allowed_mems_nr(struct hstate *h)
>  
>  	mpol_allowed = policy_nodemask_current(gfp_mask);
>  
> -	for_each_node_mask(node, cpuset_current_mems_allowed) {
> -		if (!mpol_allowed || node_isset(node, *mpol_allowed))
> +	if (is_empty_nodes(mpol_allowed))
> +		return 0;
> +
> +	for_each_node_mask(node, mpol_allowed)
>  			nr += array[node];
> -	}
>  
>  	return nr;
>  }
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d39b01fd52fe..3e936b8ca9ea 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1845,6 +1845,23 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  	return NULL;
>  }
>  
> +/*
> + * Return the allowed nodes mask for a mempolicy and page allocation,
> + * which is a 'stricter' semantic than policy_nodemsk()
> + */
> +nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> +{
> +	if (unlikely(policy->mode == MPOL_BIND)) {
> +		if (apply_policy_zone(policy, gfp_zone(gfp)) &&
> +			cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> +			return &policy->nodes;
> +		else
> +			return &zero_nodes;
> +	}
> +
> +	return NULL;
> +}
> +
>  /*
>   * Return the  preferred node id for 'prefer' mempolicy, and return
>   * the given id for all other policies.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e008a3df0485..3549ea037588 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -181,6 +181,9 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
>  };
>  EXPORT_SYMBOL(node_states);
>  
> +nodemask_t zero_nodes = NODE_MASK_NONE;
> +EXPORT_SYMBOL(zero_nodes);
> +
>  atomic_long_t _totalram_pages __read_mostly;
>  EXPORT_SYMBOL(_totalram_pages);
>  unsigned long totalreserve_pages __read_mostly;
> -- 
> 2.27.0
>
Feng Tang Aug. 3, 2022, 6:41 a.m. UTC | #8
On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote:
> Please make sure to CC Mike on hugetlb related changes.

OK.

> I didn't really get to grasp your proposed solution but it feels goind
> sideways. The real issue is that hugetlb uses a dedicated allocation
> scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not
> think we should be tricking that by providing some fake nodemasks and
> what not.
> 
> The good news is that allocation from the pool is MPOL_PREFERRED_MANY
> aware because it first tries to allocation from the preffered node mask
> and then fall back to the full nodemask (dequeue_huge_page_vma).
> If the existing pools cannot really satisfy that allocation then it
> tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also
> performs 2 stage allocation with the node mask and no node masks. But
> both of them might fail.
> 
> The bad news is that other allocation functions - including those that
> allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g.
> __nr_hugepages_store_common paths which use the allocating process
> policy to fill up the pool so the pool could be under provisioned if
> that context is using MPOL_PREFERRED_MANY.

Thanks for the check!

So you mean if the prferred nodes don't have enough pages, we should
also fallback to all like dequeue_huge_page_vma() does?

Or we can user a policy API which return nodemask for MPOL_BIND and 
NULL for all other policies, like allowed_mems_nr() needs.

--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
 	return policy_nodemask(gfp, mpol);
 }
 
+#ifdef CONFIG_HUGETLB_FS
+static inline nodemask_t *strict_policy_nodemask_current(void)
+{
+	struct mempolicy *mpol = get_task_policy(current);
+
+	if (mpol->mode == MPOL_BIND)
+		return &mpol->nodes;
+
+	return NULL;
+}
+#endif
+

> Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation
> code and I have to admit I do not really remember details there. This is
> a subtle code and my best guess would be that policy_nodemask_current
> should be hugetlb specific and only care about MPOL_BIND.

The API needed by allowed_mem_nr() is a little different as it has gfp
flag and cpuset config to consider.

Thanks,
Feng

[snip]
Michal Hocko Aug. 3, 2022, 7:36 a.m. UTC | #9
On Wed 03-08-22 14:41:20, Feng Tang wrote:
> On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote:
> > Please make sure to CC Mike on hugetlb related changes.
> 
> OK.
> 
> > I didn't really get to grasp your proposed solution but it feels goind
> > sideways. The real issue is that hugetlb uses a dedicated allocation
> > scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not
> > think we should be tricking that by providing some fake nodemasks and
> > what not.
> > 
> > The good news is that allocation from the pool is MPOL_PREFERRED_MANY
> > aware because it first tries to allocation from the preffered node mask
> > and then fall back to the full nodemask (dequeue_huge_page_vma).
> > If the existing pools cannot really satisfy that allocation then it
> > tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also
> > performs 2 stage allocation with the node mask and no node masks. But
> > both of them might fail.
> > 
> > The bad news is that other allocation functions - including those that
> > allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g.
> > __nr_hugepages_store_common paths which use the allocating process
> > policy to fill up the pool so the pool could be under provisioned if
> > that context is using MPOL_PREFERRED_MANY.
> 
> Thanks for the check!
> 
> So you mean if the prferred nodes don't have enough pages, we should
> also fallback to all like dequeue_huge_page_vma() does?
> 
> Or we can user a policy API which return nodemask for MPOL_BIND and 
> NULL for all other policies, like allowed_mems_nr() needs.
> 
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
>  	return policy_nodemask(gfp, mpol);
>  }
>  
> +#ifdef CONFIG_HUGETLB_FS
> +static inline nodemask_t *strict_policy_nodemask_current(void)
> +{
> +	struct mempolicy *mpol = get_task_policy(current);
> +
> +	if (mpol->mode == MPOL_BIND)
> +		return &mpol->nodes;
> +
> +	return NULL;
> +}
> +#endif

Yes something like this, except that I would also move this into hugetlb
proper because this doesn't seem generally useful.
 
> > Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation
> > code and I have to admit I do not really remember details there. This is
> > a subtle code and my best guess would be that policy_nodemask_current
> > should be hugetlb specific and only care about MPOL_BIND.
> 
> The API needed by allowed_mem_nr() is a little different as it has gfp
> flag and cpuset config to consider.

Why would gfp mask matter?
Michal Hocko Aug. 3, 2022, 11:28 a.m. UTC | #10
On Thu 04-08-22 01:14:32, Feng Tang wrote:
[...]
> Ok, I change it as below:

Wouldn't it be better to make this allowed_mems_nr specific to be
explicit about the intention?

Not that I feel strongly about that.

> ---
>  mm/hugetlb.c | 28 +++++++++++++++++++++++-----
>  1 file changed, 23 insertions(+), 5 deletions(-)

Not even compile tested
 include/linux/mempolicy.h | 12 ------------
 mm/hugetlb.c              | 24 ++++++++++++++++++++----
 2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 668389b4b53d..e38b0ef20b8b 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
 				const nodemask_t *mask);
 extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
 
-static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
-{
-	struct mempolicy *mpol = get_task_policy(current);
-
-	return policy_nodemask(gfp, mpol);
-}
-
 extern unsigned int mempolicy_slab_node(void);
 
 extern enum zone_type policy_zone;
@@ -294,11 +287,6 @@ static inline void mpol_put_task_policy(struct task_struct *task)
 {
 }
 
-static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
-{
-	return NULL;
-}
-
 static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 {
 	return  false;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a18c071c294e..6cacbc9b15a1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s)
 }
 __setup("default_hugepagesz=", default_hugepagesz_setup);
 
+struct mempolicy *policy_mbind_nodemask(gfp_t gfp)
+{
+#ifdef CONFIG_MEMPOLICY
+	struct mempolicy *mpol = get_task_policy(current);
+
+	/*
+	 * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask)
+	 * specifically for hugetlb case
+	 */
+	if (mpol->mode == MPOL_BIND &&
+		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
+		 cpuset_nodemask_valid_mems_allowed(&policy->nodes))
+		return &mpol->nodes;
+#endif
+	return NULL;
+}
+
 static unsigned int allowed_mems_nr(struct hstate *h)
 {
 	int node;
 	unsigned int nr = 0;
-	nodemask_t *mpol_allowed;
+	nodemask_t *mbind_nodemask;
 	unsigned int *array = h->free_huge_pages_node;
 	gfp_t gfp_mask = htlb_alloc_mask(h);
 
-	mpol_allowed = policy_nodemask_current(gfp_mask);
-
+	mbind_nodemask = policy_mbind_nodemask(gfp_mask);
 	for_each_node_mask(node, cpuset_current_mems_allowed) {
-		if (!mpol_allowed || node_isset(node, *mpol_allowed))
+		if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
 			nr += array[node];
 	}
Michal Hocko Aug. 3, 2022, 12:56 p.m. UTC | #11
On Thu 04-08-22 04:43:06, Feng Tang wrote:
> On Wed, Aug 03, 2022 at 07:28:59PM +0800, Michal Hocko wrote:
[...]
> > +struct mempolicy *policy_mbind_nodemask(gfp_t gfp)
> > +{
> > +#ifdef CONFIG_MEMPOLICY
> > +	struct mempolicy *mpol = get_task_policy(current);
> > +
> > +	/*
> > +	 * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask)
> > +	 * specifically for hugetlb case
> > +	 */
> > +	if (mpol->mode == MPOL_BIND &&
> > +		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
> > +		 cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > +		return &mpol->nodes;
> > +#endif
> > +	return NULL;
> 
> I saw the logic is not changed, and it confused me that if there is
> no qualified node, it will still return NULL which effectively equals
> node_states[N_MEMORY], while I think it should return a all zero
> nodemasks.

This is a separate thing and I have to admit that the existing code is
rather non-intuitive or even broken. I guess we do not care all that
much because MBIND with completely non-overlapping cpusets is just a
broken configuration. I am not sure this case is interesting or even
supported.
Michal Hocko Aug. 3, 2022, 1:21 p.m. UTC | #12
On Thu 04-08-22 05:08:34, Feng Tang wrote:
[...]
> Do we still need the other nodemask API I proposed earlier which has
> no parameter of gfp_flag, and used for __nr_hugepages_store_common?

I would touch as little code as possible.
Feng Tang Aug. 3, 2022, 5:14 p.m. UTC | #13
On Wed, Aug 03, 2022 at 03:36:41PM +0800, Michal Hocko wrote:
> On Wed 03-08-22 14:41:20, Feng Tang wrote:
> > On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote:
> > > Please make sure to CC Mike on hugetlb related changes.
> > 
> > OK.
> > 
> > > I didn't really get to grasp your proposed solution but it feels goind
> > > sideways. The real issue is that hugetlb uses a dedicated allocation
> > > scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not
> > > think we should be tricking that by providing some fake nodemasks and
> > > what not.
> > > 
> > > The good news is that allocation from the pool is MPOL_PREFERRED_MANY
> > > aware because it first tries to allocation from the preffered node mask
> > > and then fall back to the full nodemask (dequeue_huge_page_vma).
> > > If the existing pools cannot really satisfy that allocation then it
> > > tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also
> > > performs 2 stage allocation with the node mask and no node masks. But
> > > both of them might fail.
> > > 
> > > The bad news is that other allocation functions - including those that
> > > allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g.
> > > __nr_hugepages_store_common paths which use the allocating process
> > > policy to fill up the pool so the pool could be under provisioned if
> > > that context is using MPOL_PREFERRED_MANY.
> > 
> > Thanks for the check!
> > 
> > So you mean if the prferred nodes don't have enough pages, we should
> > also fallback to all like dequeue_huge_page_vma() does?
> > 
> > Or we can user a policy API which return nodemask for MPOL_BIND and 
> > NULL for all other policies, like allowed_mems_nr() needs.
> > 
> > --- a/include/linux/mempolicy.h
> > +++ b/include/linux/mempolicy.h
> > @@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
> >  	return policy_nodemask(gfp, mpol);
> >  }
> >  
> > +#ifdef CONFIG_HUGETLB_FS
> > +static inline nodemask_t *strict_policy_nodemask_current(void)
> > +{
> > +	struct mempolicy *mpol = get_task_policy(current);
> > +
> > +	if (mpol->mode == MPOL_BIND)
> > +		return &mpol->nodes;
> > +
> > +	return NULL;
> > +}
> > +#endif
> 
> Yes something like this, except that I would also move this into hugetlb
> proper because this doesn't seem generally useful.
  

Ok, I change it as below:

---
 mm/hugetlb.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 14be38822cf8..ef1d4ffa733f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -91,6 +91,24 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
 
+/*
+ * Return nodemask of what is allowed by current process' memory
+ * policy, as MPOL_BIND is the only 'strict' policy, return NULL
+ * for all other policies
+ */
+static inline nodemask_t *allowed_policy_nodemask_current(void)
+{
+#ifdef CONFIG_NUMA
+	struct mempolicy *mpol = get_task_policy(current);
+
+	if (mpol->mode == MPOL_BIND)
+		return &mpol->nodes;
+	return NULL;
+#else
+	return NULL;
+#endif
+}
+
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
 	if (spool->count)
@@ -3556,7 +3574,7 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 					   unsigned long count, size_t len)
 {
 	int err;
-	nodemask_t nodes_allowed, *n_mask;
+	nodemask_t nodes_allowed, *n_mask = NULL;
 
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return -EINVAL;
@@ -3565,11 +3583,11 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 		/*
 		 * global hstate attribute
 		 */
-		if (!(obey_mempolicy &&
-				init_nodemask_of_mempolicy(&nodes_allowed)))
+		if (obey_mempolicy)
+			n_mask = allowed_policy_nodemask_current();
+
+		if (!n_mask)
 			n_mask = &node_states[N_MEMORY];
-		else
-			n_mask = &nodes_allowed;
 	} else {
 		/*
 		 * Node specific request.  count adjustment happens in
Feng Tang Aug. 3, 2022, 8:43 p.m. UTC | #14
On Wed, Aug 03, 2022 at 07:28:59PM +0800, Michal Hocko wrote:
> On Thu 04-08-22 01:14:32, Feng Tang wrote:
> [...]
> > Ok, I change it as below:
> 
> Wouldn't it be better to make this allowed_mems_nr specific to be
> explicit about the intention?
 
Yes, it is.

> Not that I feel strongly about that.
> 
> > ---
> >  mm/hugetlb.c | 28 +++++++++++++++++++++++-----
> >  1 file changed, 23 insertions(+), 5 deletions(-)
> 
> Not even compile tested
>  include/linux/mempolicy.h | 12 ------------
>  mm/hugetlb.c              | 24 ++++++++++++++++++++----
>  2 files changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 668389b4b53d..e38b0ef20b8b 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
>  				const nodemask_t *mask);
>  extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
>  
> -static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
> -{
> -	struct mempolicy *mpol = get_task_policy(current);
> -
> -	return policy_nodemask(gfp, mpol);
> -}
> -
>  extern unsigned int mempolicy_slab_node(void);
>  
>  extern enum zone_type policy_zone;
> @@ -294,11 +287,6 @@ static inline void mpol_put_task_policy(struct task_struct *task)
>  {
>  }
>  
> -static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
> -{
> -	return NULL;
> -}
> -
>  static inline bool mpol_is_preferred_many(struct mempolicy *pol)
>  {
>  	return  false;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a18c071c294e..6cacbc9b15a1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s)
>  }
>  __setup("default_hugepagesz=", default_hugepagesz_setup);
>  
> +struct mempolicy *policy_mbind_nodemask(gfp_t gfp)
> +{
> +#ifdef CONFIG_MEMPOLICY
> +	struct mempolicy *mpol = get_task_policy(current);
> +
> +	/*
> +	 * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask)
> +	 * specifically for hugetlb case
> +	 */
> +	if (mpol->mode == MPOL_BIND &&
> +		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
> +		 cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> +		return &mpol->nodes;
> +#endif
> +	return NULL;

I saw the logic is not changed, and it confused me that if there is
no qualified node, it will still return NULL which effectively equals
node_states[N_MEMORY], while I think it should return a all zero
nodemasks.

Thanks,
Feng

> +}
> +
>  static unsigned int allowed_mems_nr(struct hstate *h)
>  {
>  	int node;
>  	unsigned int nr = 0;
> -	nodemask_t *mpol_allowed;
> +	nodemask_t *mbind_nodemask;
>  	unsigned int *array = h->free_huge_pages_node;
>  	gfp_t gfp_mask = htlb_alloc_mask(h);
>  
> -	mpol_allowed = policy_nodemask_current(gfp_mask);
> -
> +	mbind_nodemask = policy_mbind_nodemask(gfp_mask);
>  	for_each_node_mask(node, cpuset_current_mems_allowed) {
> -		if (!mpol_allowed || node_isset(node, *mpol_allowed))
> +		if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
>  			nr += array[node];
>  	}
>  
> -- 
> Michal Hocko
> SUSE Labs
Feng Tang Aug. 3, 2022, 9:08 p.m. UTC | #15
On Wed, Aug 03, 2022 at 08:56:44PM +0800, Michal Hocko wrote:
> On Thu 04-08-22 04:43:06, Feng Tang wrote:
> > On Wed, Aug 03, 2022 at 07:28:59PM +0800, Michal Hocko wrote:
> [...]
> > > +struct mempolicy *policy_mbind_nodemask(gfp_t gfp)
> > > +{
> > > +#ifdef CONFIG_MEMPOLICY
> > > +	struct mempolicy *mpol = get_task_policy(current);
> > > +
> > > +	/*
> > > +	 * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask)
> > > +	 * specifically for hugetlb case
> > > +	 */
> > > +	if (mpol->mode == MPOL_BIND &&
> > > +		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
> > > +		 cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > > +		return &mpol->nodes;
> > > +#endif
> > > +	return NULL;
> > 
> > I saw the logic is not changed, and it confused me that if there is
> > no qualified node, it will still return NULL which effectively equals
> > node_states[N_MEMORY], while I think it should return a all zero
> > nodemasks.
> 
> This is a separate thing and I have to admit that the existing code is
> rather non-intuitive or even broken. I guess we do not care all that
> much because MBIND with completely non-overlapping cpusets is just a
> broken configuration. I am not sure this case is interesting or even
> supported.

Fair enough, and moving the policy_mbind_nodemask() into hugetlb.c for
one single caller make it much less severe.

Do we still need the other nodemask API I proposed earlier which has
no parameter of gfp_flag, and used for __nr_hugepages_store_common?

Thanks,
Feng


> -- 
> Michal Hocko
> SUSE Labs
>
Feng Tang Aug. 4, 2022, 8:27 a.m. UTC | #16
On Wed, Aug 03, 2022 at 09:21:22PM +0800, Michal Hocko wrote:
> On Thu 04-08-22 05:08:34, Feng Tang wrote:
> [...]
> > Do we still need the other nodemask API I proposed earlier which has
> > no parameter of gfp_flag, and used for __nr_hugepages_store_common?
> 
> I would touch as little code as possible.

OK.

Please review the following patch, thanks! - Feng

---
From a2db9a57da616bb3ea21e48a4a9ceb5c2cf4f7a2 Mon Sep 17 00:00:00 2001
From: Feng Tang <feng.tang@intel.com>
Date: Thu, 4 Aug 2022 09:39:24 +0800
Subject: [PATCH RFC] mm/hugetlb: add dedicated func to get 'allowed' nodemask for
 current process

Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced
in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
[1], the policy_nodemask_current()'s semantics for this new policy
has been changed, which returns 'preferred' nodes instead of 'allowed'
nodes, and could hurt the usage of its caller in hugetlb:
allowed_mems_nr().

Michal found the policy_nodemask_current() is only used by hugetlb,
and suggested to move it to hugetlb code with more explicit name to
enforce the 'allowed' semantics for which only MPOL_BIND policy
matters.

One note for the new policy_mbind_nodemask() is, the cross check
from MPOL_BIND, gfp flags and cpuset configuration can lead to
a no available node case, which is considered to be broken
configuration and 'NULL' (equals all nodes) is returned.

[1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/
Reported-by: Muchun Song <songmuchun@bytedance.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/linux/mempolicy.h | 32 ++++++++++++++++++++------------
 mm/hugetlb.c              | 24 ++++++++++++++++++++----
 mm/mempolicy.c            | 20 --------------------
 3 files changed, 40 insertions(+), 36 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 668389b4b53d..ea0168fffb4c 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
 				const nodemask_t *mask);
 extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
 
-static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
-{
-	struct mempolicy *mpol = get_task_policy(current);
-
-	return policy_nodemask(gfp, mpol);
-}
-
 extern unsigned int mempolicy_slab_node(void);
 
 extern enum zone_type policy_zone;
@@ -189,6 +182,26 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 	return  (pol->mode == MPOL_PREFERRED_MANY);
 }
 
+static inline int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
+{
+	enum zone_type dynamic_policy_zone = policy_zone;
+
+	BUG_ON(dynamic_policy_zone == ZONE_MOVABLE);
+
+	/*
+	 * if policy->nodes has movable memory only,
+	 * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only.
+	 *
+	 * policy->nodes is intersect with node_states[N_MEMORY].
+	 * so if the following test fails, it implies
+	 * policy->nodes has movable memory only.
+	 */
+	if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY]))
+		dynamic_policy_zone = ZONE_MOVABLE;
+
+	return zone >= dynamic_policy_zone;
+}
+
 
 #else
 
@@ -294,11 +307,6 @@ static inline void mpol_put_task_policy(struct task_struct *task)
 {
 }
 
-static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
-{
-	return NULL;
-}
-
 static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 {
 	return  false;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a18c071c294e..ad84bb85b6de 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s)
 }
 __setup("default_hugepagesz=", default_hugepagesz_setup);
 
+static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
+{
+#ifdef CONFIG_NUMA
+	struct mempolicy *mpol = get_task_policy(current);
+
+	/*
+	 * Only enforce MPOL_BIND policy which overlaps with cpuset policy
+	 * (from policy_nodemask) specifically for hugetlb case
+	 */
+	if (mpol->mode == MPOL_BIND &&
+		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
+		 cpuset_nodemask_valid_mems_allowed(&mpol->nodes)))
+		return &mpol->nodes;
+#endif
+	return NULL;
+}
+
 static unsigned int allowed_mems_nr(struct hstate *h)
 {
 	int node;
 	unsigned int nr = 0;
-	nodemask_t *mpol_allowed;
+	nodemask_t *mbind_nodemask;
 	unsigned int *array = h->free_huge_pages_node;
 	gfp_t gfp_mask = htlb_alloc_mask(h);
 
-	mpol_allowed = policy_nodemask_current(gfp_mask);
-
+	mbind_nodemask = policy_mbind_nodemask(gfp_mask);
 	for_each_node_mask(node, cpuset_current_mems_allowed) {
-		if (!mpol_allowed || node_isset(node, *mpol_allowed))
+		if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
 			nr += array[node];
 	}
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d39b01fd52fe..5553bd53927f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1805,26 +1805,6 @@ bool vma_policy_mof(struct vm_area_struct *vma)
 	return pol->flags & MPOL_F_MOF;
 }
 
-static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
-{
-	enum zone_type dynamic_policy_zone = policy_zone;
-
-	BUG_ON(dynamic_policy_zone == ZONE_MOVABLE);
-
-	/*
-	 * if policy->nodes has movable memory only,
-	 * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only.
-	 *
-	 * policy->nodes is intersect with node_states[N_MEMORY].
-	 * so if the following test fails, it implies
-	 * policy->nodes has movable memory only.
-	 */
-	if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY]))
-		dynamic_policy_zone = ZONE_MOVABLE;
-
-	return zone >= dynamic_policy_zone;
-}
-
 /*
  * Return a nodemask representing a mempolicy for filtering nodes for
  * page allocation
Michal Hocko Aug. 4, 2022, 10:43 a.m. UTC | #17
On Thu 04-08-22 16:27:24, Feng Tang wrote:
[...]
> >From a2db9a57da616bb3ea21e48a4a9ceb5c2cf4f7a2 Mon Sep 17 00:00:00 2001
> From: Feng Tang <feng.tang@intel.com>
> Date: Thu, 4 Aug 2022 09:39:24 +0800
> Subject: [PATCH RFC] mm/hugetlb: add dedicated func to get 'allowed' nodemask for
>  current process
> 
> Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced
> in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> [1], the policy_nodemask_current()'s semantics for this new policy
> has been changed, which returns 'preferred' nodes instead of 'allowed'
> nodes, and could hurt the usage of its caller in hugetlb:
> allowed_mems_nr().
> 
> Michal found the policy_nodemask_current() is only used by hugetlb,
> and suggested to move it to hugetlb code with more explicit name to
> enforce the 'allowed' semantics for which only MPOL_BIND policy
> matters.
> 
> One note for the new policy_mbind_nodemask() is, the cross check
> from MPOL_BIND, gfp flags and cpuset configuration can lead to
> a no available node case, which is considered to be broken
> configuration and 'NULL' (equals all nodes) is returned.
> 
> [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/
> Reported-by: Muchun Song <songmuchun@bytedance.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>

LGTM I would just make apply_policy_zone extern rather than making it
static inline in a header which can turn out to cause other header
dependencies.

Thanks!
diff mbox series

Patch

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 6c27acb6cd63..4deec7e598c6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1845,9 +1845,6 @@  nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
 		return &policy->nodes;
 
-	if (mode == MPOL_PREFERRED_MANY)
-		return &policy->nodes;
-
 	return NULL;
 }