diff mbox series

[v3,RFC,14/14] mm: speedup page alloc for MPOL_PREFERRED_MANY by adding a NO_SLOWPATH gfp bit

Message ID 1614766858-90344-15-git-send-email-feng.tang@intel.com (mailing list archive)
State New, archived
Headers show
Series Introduced multi-preference mempolicy | expand

Commit Message

Feng Tang March 3, 2021, 10:20 a.m. UTC
When doing broader test, we noticed allocation slowness in one test
case that malloc memory with size which is slightly bigger than free
memory of targeted nodes, but much less then the total free memory
of system.

The reason is the code enters the slowpath of __alloc_pages_nodemask(),
which takes quite some time. As alloc_pages_policy() will give it a 2nd
try with NULL nodemask, so there is no need to enter the slowpath for
the first try. Add a new gfp bit to skip the slowpath, so that user cases
like this can leverage.

With it, the malloc in such case is much accelerated as it never enters
the slowpath.

Adding a new gfp_mask bit is generally not liked, and another idea is to
add another nodemask to struct 'alloc_context', so it has 2: 'preferred-nmask'
and 'fallback-nmask', and they will be tried in turn if not NULL, with
it we can call __alloc_pages_nodemask() only once.

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/linux/gfp.h | 9 +++++++--
 mm/mempolicy.c      | 2 +-
 mm/page_alloc.c     | 2 +-
 3 files changed, 9 insertions(+), 4 deletions(-)

Comments

Michal Hocko March 3, 2021, 11:39 a.m. UTC | #1
On Wed 03-03-21 18:20:58, Feng Tang wrote:
> When doing broader test, we noticed allocation slowness in one test
> case that malloc memory with size which is slightly bigger than free
> memory of targeted nodes, but much less then the total free memory
> of system.
> 
> The reason is the code enters the slowpath of __alloc_pages_nodemask(),
> which takes quite some time. As alloc_pages_policy() will give it a 2nd
> try with NULL nodemask, so there is no need to enter the slowpath for
> the first try. Add a new gfp bit to skip the slowpath, so that user cases
> like this can leverage.
> 
> With it, the malloc in such case is much accelerated as it never enters
> the slowpath.
> 
> Adding a new gfp_mask bit is generally not liked, and another idea is to
> add another nodemask to struct 'alloc_context', so it has 2: 'preferred-nmask'
> and 'fallback-nmask', and they will be tried in turn if not NULL, with
> it we can call __alloc_pages_nodemask() only once.

Yes, it is very much disliked. Is there any reason why you cannot use
GFP_NOWAIT for that purpose?
Feng Tang March 3, 2021, 12:07 p.m. UTC | #2
Hi Michal,

On Wed, Mar 03, 2021 at 12:39:57PM +0100, Michal Hocko wrote:
> On Wed 03-03-21 18:20:58, Feng Tang wrote:
> > When doing broader test, we noticed allocation slowness in one test
> > case that malloc memory with size which is slightly bigger than free
> > memory of targeted nodes, but much less then the total free memory
> > of system.
> > 
> > The reason is the code enters the slowpath of __alloc_pages_nodemask(),
> > which takes quite some time. As alloc_pages_policy() will give it a 2nd
> > try with NULL nodemask, so there is no need to enter the slowpath for
> > the first try. Add a new gfp bit to skip the slowpath, so that user cases
> > like this can leverage.
> > 
> > With it, the malloc in such case is much accelerated as it never enters
> > the slowpath.
> > 
> > Adding a new gfp_mask bit is generally not liked, and another idea is to
> > add another nodemask to struct 'alloc_context', so it has 2: 'preferred-nmask'
> > and 'fallback-nmask', and they will be tried in turn if not NULL, with
> > it we can call __alloc_pages_nodemask() only once.
> 
> Yes, it is very much disliked. Is there any reason why you cannot use
> GFP_NOWAIT for that purpose?

I did try that at the first place, but it didn't obviously change the slowness.
I assumed the direct claim was still involved as GFP_NOWAIT only impact kswapd
reclaim.

Thanks,
Feng


> -- 
> Michal Hocko
> SUSE Labs
Feng Tang March 3, 2021, 12:18 p.m. UTC | #3
On Wed, Mar 03, 2021 at 08:07:17PM +0800, Tang, Feng wrote:
> Hi Michal,
> 
> On Wed, Mar 03, 2021 at 12:39:57PM +0100, Michal Hocko wrote:
> > On Wed 03-03-21 18:20:58, Feng Tang wrote:
> > > When doing broader test, we noticed allocation slowness in one test
> > > case that malloc memory with size which is slightly bigger than free
> > > memory of targeted nodes, but much less then the total free memory
> > > of system.
> > > 
> > > The reason is the code enters the slowpath of __alloc_pages_nodemask(),
> > > which takes quite some time. As alloc_pages_policy() will give it a 2nd
> > > try with NULL nodemask, so there is no need to enter the slowpath for
> > > the first try. Add a new gfp bit to skip the slowpath, so that user cases
> > > like this can leverage.
> > > 
> > > With it, the malloc in such case is much accelerated as it never enters
> > > the slowpath.
> > > 
> > > Adding a new gfp_mask bit is generally not liked, and another idea is to
> > > add another nodemask to struct 'alloc_context', so it has 2: 'preferred-nmask'
> > > and 'fallback-nmask', and they will be tried in turn if not NULL, with
> > > it we can call __alloc_pages_nodemask() only once.
> > 
> > Yes, it is very much disliked. Is there any reason why you cannot use
> > GFP_NOWAIT for that purpose?
> 
> I did try that at the first place, but it didn't obviously change the slowness.
> I assumed the direct claim was still involved as GFP_NOWAIT only impact kswapd
> reclaim.

One thing I tried which can fix the slowness is:

+	gfp_mask &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);

which explicitly clears the 2 kinds of reclaim. And I thought it's too
hacky and didn't mention it in the commit log.

Thanks,
Feng
Michal Hocko March 3, 2021, 12:32 p.m. UTC | #4
On Wed 03-03-21 20:18:33, Feng Tang wrote:
> On Wed, Mar 03, 2021 at 08:07:17PM +0800, Tang, Feng wrote:
> > Hi Michal,
> > 
> > On Wed, Mar 03, 2021 at 12:39:57PM +0100, Michal Hocko wrote:
> > > On Wed 03-03-21 18:20:58, Feng Tang wrote:
> > > > When doing broader test, we noticed allocation slowness in one test
> > > > case that malloc memory with size which is slightly bigger than free
> > > > memory of targeted nodes, but much less then the total free memory
> > > > of system.
> > > > 
> > > > The reason is the code enters the slowpath of __alloc_pages_nodemask(),
> > > > which takes quite some time. As alloc_pages_policy() will give it a 2nd
> > > > try with NULL nodemask, so there is no need to enter the slowpath for
> > > > the first try. Add a new gfp bit to skip the slowpath, so that user cases
> > > > like this can leverage.
> > > > 
> > > > With it, the malloc in such case is much accelerated as it never enters
> > > > the slowpath.
> > > > 
> > > > Adding a new gfp_mask bit is generally not liked, and another idea is to
> > > > add another nodemask to struct 'alloc_context', so it has 2: 'preferred-nmask'
> > > > and 'fallback-nmask', and they will be tried in turn if not NULL, with
> > > > it we can call __alloc_pages_nodemask() only once.
> > > 
> > > Yes, it is very much disliked. Is there any reason why you cannot use
> > > GFP_NOWAIT for that purpose?
> > 
> > I did try that at the first place, but it didn't obviously change the slowness.
> > I assumed the direct claim was still involved as GFP_NOWAIT only impact kswapd
> > reclaim.

I assume you haven't really created gfp mask correctly. What was the
exact gfp mask you have used?

> 
> One thing I tried which can fix the slowness is:
> 
> +	gfp_mask &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);
> 
> which explicitly clears the 2 kinds of reclaim. And I thought it's too
> hacky and didn't mention it in the commit log.

Clearing __GFP_DIRECT_RECLAIM would be the right way to achieve
GFP_NOWAIT semantic. Why would you want to exclude kswapd as well?
Feng Tang March 3, 2021, 1:18 p.m. UTC | #5
On Wed, Mar 03, 2021 at 01:32:11PM +0100, Michal Hocko wrote:
> On Wed 03-03-21 20:18:33, Feng Tang wrote:
> > On Wed, Mar 03, 2021 at 08:07:17PM +0800, Tang, Feng wrote:
> > > Hi Michal,
> > > 
> > > On Wed, Mar 03, 2021 at 12:39:57PM +0100, Michal Hocko wrote:
> > > > On Wed 03-03-21 18:20:58, Feng Tang wrote:
> > > > > When doing broader test, we noticed allocation slowness in one test
> > > > > case that malloc memory with size which is slightly bigger than free
> > > > > memory of targeted nodes, but much less then the total free memory
> > > > > of system.
> > > > > 
> > > > > The reason is the code enters the slowpath of __alloc_pages_nodemask(),
> > > > > which takes quite some time. As alloc_pages_policy() will give it a 2nd
> > > > > try with NULL nodemask, so there is no need to enter the slowpath for
> > > > > the first try. Add a new gfp bit to skip the slowpath, so that user cases
> > > > > like this can leverage.
> > > > > 
> > > > > With it, the malloc in such case is much accelerated as it never enters
> > > > > the slowpath.
> > > > > 
> > > > > Adding a new gfp_mask bit is generally not liked, and another idea is to
> > > > > add another nodemask to struct 'alloc_context', so it has 2: 'preferred-nmask'
> > > > > and 'fallback-nmask', and they will be tried in turn if not NULL, with
> > > > > it we can call __alloc_pages_nodemask() only once.
> > > > 
> > > > Yes, it is very much disliked. Is there any reason why you cannot use
> > > > GFP_NOWAIT for that purpose?
> > > 
> > > I did try that at the first place, but it didn't obviously change the slowness.
> > > I assumed the direct claim was still involved as GFP_NOWAIT only impact kswapd
> > > reclaim.
> 
> I assume you haven't really created gfp mask correctly. What was the
> exact gfp mask you have used?

The testcase is a malloc with multi-preferred-node policy, IIRC, the gfp
mask is HIGHUSER_MOVABLE originally, and code here ORs (__GFP_RETRY_MAYFAIL | __GFP_NOWARN).

As GFP_WAIT == __GFP_KSWAPD_RECLAIM, in this test case, the bit is already set.

> > 
> > One thing I tried which can fix the slowness is:
> > 
> > +	gfp_mask &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);
> > 
> > which explicitly clears the 2 kinds of reclaim. And I thought it's too
> > hacky and didn't mention it in the commit log.
> 
> Clearing __GFP_DIRECT_RECLAIM would be the right way to achieve
> GFP_NOWAIT semantic. Why would you want to exclude kswapd as well? 

When I tried gfp_mask &= ~__GFP_DIRECT_RECLAIM, the slowness couldn't
be fixed.

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs
Feng Tang March 3, 2021, 1:46 p.m. UTC | #6
On Wed, Mar 03, 2021 at 09:18:32PM +0800, Tang, Feng wrote:
> On Wed, Mar 03, 2021 at 01:32:11PM +0100, Michal Hocko wrote:
> > On Wed 03-03-21 20:18:33, Feng Tang wrote:
> > > On Wed, Mar 03, 2021 at 08:07:17PM +0800, Tang, Feng wrote:
> > > > Hi Michal,
> > > > 
> > > > On Wed, Mar 03, 2021 at 12:39:57PM +0100, Michal Hocko wrote:
> > > > > On Wed 03-03-21 18:20:58, Feng Tang wrote:
> > > > > > When doing broader test, we noticed allocation slowness in one test
> > > > > > case that malloc memory with size which is slightly bigger than free
> > > > > > memory of targeted nodes, but much less then the total free memory
> > > > > > of system.
> > > > > > 
> > > > > > The reason is the code enters the slowpath of __alloc_pages_nodemask(),
> > > > > > which takes quite some time. As alloc_pages_policy() will give it a 2nd
> > > > > > try with NULL nodemask, so there is no need to enter the slowpath for
> > > > > > the first try. Add a new gfp bit to skip the slowpath, so that user cases
> > > > > > like this can leverage.
> > > > > > 
> > > > > > With it, the malloc in such case is much accelerated as it never enters
> > > > > > the slowpath.
> > > > > > 
> > > > > > Adding a new gfp_mask bit is generally not liked, and another idea is to
> > > > > > add another nodemask to struct 'alloc_context', so it has 2: 'preferred-nmask'
> > > > > > and 'fallback-nmask', and they will be tried in turn if not NULL, with
> > > > > > it we can call __alloc_pages_nodemask() only once.
> > > > > 
> > > > > Yes, it is very much disliked. Is there any reason why you cannot use
> > > > > GFP_NOWAIT for that purpose?
> > > > 
> > > > I did try that at the first place, but it didn't obviously change the slowness.
> > > > I assumed the direct claim was still involved as GFP_NOWAIT only impact kswapd
> > > > reclaim.
> > 
> > I assume you haven't really created gfp mask correctly. What was the
> > exact gfp mask you have used?
> 
> The testcase is a malloc with multi-preferred-node policy, IIRC, the gfp
> mask is HIGHUSER_MOVABLE originally, and code here ORs (__GFP_RETRY_MAYFAIL | __GFP_NOWARN).
> 
> As GFP_WAIT == __GFP_KSWAPD_RECLAIM, in this test case, the bit is already set.
> 
> > > 
> > > One thing I tried which can fix the slowness is:
> > > 
> > > +	gfp_mask &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);
> > > 
> > > which explicitly clears the 2 kinds of reclaim. And I thought it's too
> > > hacky and didn't mention it in the commit log.
> > 
> > Clearing __GFP_DIRECT_RECLAIM would be the right way to achieve
> > GFP_NOWAIT semantic. Why would you want to exclude kswapd as well? 
> 
> When I tried gfp_mask &= ~__GFP_DIRECT_RECLAIM, the slowness couldn't
> be fixed.

I just double checked by rerun the test, 'gfp_mask &= ~__GFP_DIRECT_RECLAIM'
can also accelerate the allocation much! though is still a little slower than
this patch. Seems I've messed some of the tries, and sorry for the confusion!

Could this be used as the solution? or the adding another fallback_nodemask way?
but the latter will change the current API quite a bit.

Thanks,
Feng
Michal Hocko March 3, 2021, 1:53 p.m. UTC | #7
On Wed 03-03-21 21:18:32, Feng Tang wrote:
> On Wed, Mar 03, 2021 at 01:32:11PM +0100, Michal Hocko wrote:
> > On Wed 03-03-21 20:18:33, Feng Tang wrote:
> > > On Wed, Mar 03, 2021 at 08:07:17PM +0800, Tang, Feng wrote:
> > > > Hi Michal,
> > > > 
> > > > On Wed, Mar 03, 2021 at 12:39:57PM +0100, Michal Hocko wrote:
> > > > > On Wed 03-03-21 18:20:58, Feng Tang wrote:
> > > > > > When doing broader test, we noticed allocation slowness in one test
> > > > > > case that malloc memory with size which is slightly bigger than free
> > > > > > memory of targeted nodes, but much less then the total free memory
> > > > > > of system.
> > > > > > 
> > > > > > The reason is the code enters the slowpath of __alloc_pages_nodemask(),
> > > > > > which takes quite some time. As alloc_pages_policy() will give it a 2nd
> > > > > > try with NULL nodemask, so there is no need to enter the slowpath for
> > > > > > the first try. Add a new gfp bit to skip the slowpath, so that user cases
> > > > > > like this can leverage.
> > > > > > 
> > > > > > With it, the malloc in such case is much accelerated as it never enters
> > > > > > the slowpath.
> > > > > > 
> > > > > > Adding a new gfp_mask bit is generally not liked, and another idea is to
> > > > > > add another nodemask to struct 'alloc_context', so it has 2: 'preferred-nmask'
> > > > > > and 'fallback-nmask', and they will be tried in turn if not NULL, with
> > > > > > it we can call __alloc_pages_nodemask() only once.
> > > > > 
> > > > > Yes, it is very much disliked. Is there any reason why you cannot use
> > > > > GFP_NOWAIT for that purpose?
> > > > 
> > > > I did try that at the first place, but it didn't obviously change the slowness.
> > > > I assumed the direct claim was still involved as GFP_NOWAIT only impact kswapd
> > > > reclaim.
> > 
> > I assume you haven't really created gfp mask correctly. What was the
> > exact gfp mask you have used?
> 
> The testcase is a malloc with multi-preferred-node policy, IIRC, the gfp
> mask is HIGHUSER_MOVABLE originally, and code here ORs (__GFP_RETRY_MAYFAIL | __GFP_NOWARN).
> 
> As GFP_WAIT == __GFP_KSWAPD_RECLAIM, in this test case, the bit is already set.

Yes, you have to clear the gfp flag for the direct reclaim. I can see
how that can be confusing though
 
> > > One thing I tried which can fix the slowness is:
> > > 
> > > +	gfp_mask &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);
> > > 
> > > which explicitly clears the 2 kinds of reclaim. And I thought it's too
> > > hacky and didn't mention it in the commit log.
> > 
> > Clearing __GFP_DIRECT_RECLAIM would be the right way to achieve
> > GFP_NOWAIT semantic. Why would you want to exclude kswapd as well? 
> 
> When I tried gfp_mask &= ~__GFP_DIRECT_RECLAIM, the slowness couldn't
> be fixed.

OK, I thought that you wanted to prevent the direct reclaim because that
is the usual suspect for a slow down. If this is not not related to the
direct reclaim then please try to find out what the acutal bottle neck
is. Also how big of a slowdown are we talking about here?
Michal Hocko March 3, 2021, 1:59 p.m. UTC | #8
On Wed 03-03-21 21:46:44, Feng Tang wrote:
> On Wed, Mar 03, 2021 at 09:18:32PM +0800, Tang, Feng wrote:
> > On Wed, Mar 03, 2021 at 01:32:11PM +0100, Michal Hocko wrote:
> > > On Wed 03-03-21 20:18:33, Feng Tang wrote:
[...]
> > > > One thing I tried which can fix the slowness is:
> > > > 
> > > > +	gfp_mask &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);
> > > > 
> > > > which explicitly clears the 2 kinds of reclaim. And I thought it's too
> > > > hacky and didn't mention it in the commit log.
> > > 
> > > Clearing __GFP_DIRECT_RECLAIM would be the right way to achieve
> > > GFP_NOWAIT semantic. Why would you want to exclude kswapd as well? 
> > 
> > When I tried gfp_mask &= ~__GFP_DIRECT_RECLAIM, the slowness couldn't
> > be fixed.
> 
> I just double checked by rerun the test, 'gfp_mask &= ~__GFP_DIRECT_RECLAIM'
> can also accelerate the allocation much! though is still a little slower than
> this patch. Seems I've messed some of the tries, and sorry for the confusion!
> 
> Could this be used as the solution? or the adding another fallback_nodemask way?
> but the latter will change the current API quite a bit.

I haven't got to the whole series yet. The real question is whether the
first attempt to enforce the preferred mask is a general win. I would
argue that it resembles the existing single node preferred memory policy
because that one doesn't push heavily on the preferred node either. So
dropping just the direct reclaim mode makes some sense to me.

IIRC this is something I was recommending in an early proposal of the
feature.
Ben Widawsky March 3, 2021, 4:31 p.m. UTC | #9
On 21-03-03 14:59:35, Michal Hocko wrote:
> On Wed 03-03-21 21:46:44, Feng Tang wrote:
> > On Wed, Mar 03, 2021 at 09:18:32PM +0800, Tang, Feng wrote:
> > > On Wed, Mar 03, 2021 at 01:32:11PM +0100, Michal Hocko wrote:
> > > > On Wed 03-03-21 20:18:33, Feng Tang wrote:
> [...]
> > > > > One thing I tried which can fix the slowness is:
> > > > > 
> > > > > +	gfp_mask &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);
> > > > > 
> > > > > which explicitly clears the 2 kinds of reclaim. And I thought it's too
> > > > > hacky and didn't mention it in the commit log.
> > > > 
> > > > Clearing __GFP_DIRECT_RECLAIM would be the right way to achieve
> > > > GFP_NOWAIT semantic. Why would you want to exclude kswapd as well? 
> > > 
> > > When I tried gfp_mask &= ~__GFP_DIRECT_RECLAIM, the slowness couldn't
> > > be fixed.
> > 
> > I just double checked by rerun the test, 'gfp_mask &= ~__GFP_DIRECT_RECLAIM'
> > can also accelerate the allocation much! though is still a little slower than
> > this patch. Seems I've messed some of the tries, and sorry for the confusion!
> > 
> > Could this be used as the solution? or the adding another fallback_nodemask way?
> > but the latter will change the current API quite a bit.
> 
> I haven't got to the whole series yet. The real question is whether the
> first attempt to enforce the preferred mask is a general win. I would
> argue that it resembles the existing single node preferred memory policy
> because that one doesn't push heavily on the preferred node either. So
> dropping just the direct reclaim mode makes some sense to me.
> 
> IIRC this is something I was recommending in an early proposal of the
> feature.

My assumption [FWIW] is that the usecases we've outlined for multi-preferred
would want more heavy pushing on the preference mask. However, maybe the uapi
could dictate how hard to try/not try.
Dave Hansen March 3, 2021, 4:48 p.m. UTC | #10
On 3/3/21 8:31 AM, Ben Widawsky wrote:
>> I haven't got to the whole series yet. The real question is whether the
>> first attempt to enforce the preferred mask is a general win. I would
>> argue that it resembles the existing single node preferred memory policy
>> because that one doesn't push heavily on the preferred node either. So
>> dropping just the direct reclaim mode makes some sense to me.
>>
>> IIRC this is something I was recommending in an early proposal of the
>> feature.
> My assumption [FWIW] is that the usecases we've outlined for multi-preferred
> would want more heavy pushing on the preference mask. However, maybe the uapi
> could dictate how hard to try/not try.

There are two things that I think are important:

1. MPOL_PREFERRED_MANY fallback away from the preferred nodes should be
   *temporary*, even in the face of the preferred set being full.  That
   means that _some_ reclaim needs to be done.  Kicking off kswapd is
   fine for this.
2. MPOL_PREFERRED_MANY behavior should resemble MPOL_PREFERRED as
   closely as possible.  We're just going to confuse users if they set a
   single node in a MPOL_PREFERRED_MANY mask and get different behavior
   from MPOL_PREFERRED.

While it would be nice, short-term, to steer MPOL_PREFERRED_MANY
behavior toward how we expect it to get used first, I think it's a
mistake if we do it at the cost of long-term divergence from MPOL_PREFERRED.
Michal Hocko March 3, 2021, 5:14 p.m. UTC | #11
On Wed 03-03-21 08:31:41, Ben Widawsky wrote:
> On 21-03-03 14:59:35, Michal Hocko wrote:
> > On Wed 03-03-21 21:46:44, Feng Tang wrote:
> > > On Wed, Mar 03, 2021 at 09:18:32PM +0800, Tang, Feng wrote:
> > > > On Wed, Mar 03, 2021 at 01:32:11PM +0100, Michal Hocko wrote:
> > > > > On Wed 03-03-21 20:18:33, Feng Tang wrote:
> > [...]
> > > > > > One thing I tried which can fix the slowness is:
> > > > > > 
> > > > > > +	gfp_mask &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);
> > > > > > 
> > > > > > which explicitly clears the 2 kinds of reclaim. And I thought it's too
> > > > > > hacky and didn't mention it in the commit log.
> > > > > 
> > > > > Clearing __GFP_DIRECT_RECLAIM would be the right way to achieve
> > > > > GFP_NOWAIT semantic. Why would you want to exclude kswapd as well? 
> > > > 
> > > > When I tried gfp_mask &= ~__GFP_DIRECT_RECLAIM, the slowness couldn't
> > > > be fixed.
> > > 
> > > I just double checked by rerun the test, 'gfp_mask &= ~__GFP_DIRECT_RECLAIM'
> > > can also accelerate the allocation much! though is still a little slower than
> > > this patch. Seems I've messed some of the tries, and sorry for the confusion!
> > > 
> > > Could this be used as the solution? or the adding another fallback_nodemask way?
> > > but the latter will change the current API quite a bit.
> > 
> > I haven't got to the whole series yet. The real question is whether the
> > first attempt to enforce the preferred mask is a general win. I would
> > argue that it resembles the existing single node preferred memory policy
> > because that one doesn't push heavily on the preferred node either. So
> > dropping just the direct reclaim mode makes some sense to me.
> > 
> > IIRC this is something I was recommending in an early proposal of the
> > feature.
> 
> My assumption [FWIW] is that the usecases we've outlined for multi-preferred
> would want more heavy pushing on the preference mask. However, maybe the uapi
> could dictate how hard to try/not try.

What does that mean and what is the expectation from the kernel to be
more or less cast in stone?
Ben Widawsky March 3, 2021, 5:22 p.m. UTC | #12
On 21-03-03 18:14:30, Michal Hocko wrote:
> On Wed 03-03-21 08:31:41, Ben Widawsky wrote:
> > On 21-03-03 14:59:35, Michal Hocko wrote:
> > > On Wed 03-03-21 21:46:44, Feng Tang wrote:
> > > > On Wed, Mar 03, 2021 at 09:18:32PM +0800, Tang, Feng wrote:
> > > > > On Wed, Mar 03, 2021 at 01:32:11PM +0100, Michal Hocko wrote:
> > > > > > On Wed 03-03-21 20:18:33, Feng Tang wrote:
> > > [...]
> > > > > > > One thing I tried which can fix the slowness is:
> > > > > > > 
> > > > > > > +	gfp_mask &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);
> > > > > > > 
> > > > > > > which explicitly clears the 2 kinds of reclaim. And I thought it's too
> > > > > > > hacky and didn't mention it in the commit log.
> > > > > > 
> > > > > > Clearing __GFP_DIRECT_RECLAIM would be the right way to achieve
> > > > > > GFP_NOWAIT semantic. Why would you want to exclude kswapd as well? 
> > > > > 
> > > > > When I tried gfp_mask &= ~__GFP_DIRECT_RECLAIM, the slowness couldn't
> > > > > be fixed.
> > > > 
> > > > I just double checked by rerun the test, 'gfp_mask &= ~__GFP_DIRECT_RECLAIM'
> > > > can also accelerate the allocation much! though is still a little slower than
> > > > this patch. Seems I've messed some of the tries, and sorry for the confusion!
> > > > 
> > > > Could this be used as the solution? or the adding another fallback_nodemask way?
> > > > but the latter will change the current API quite a bit.
> > > 
> > > I haven't got to the whole series yet. The real question is whether the
> > > first attempt to enforce the preferred mask is a general win. I would
> > > argue that it resembles the existing single node preferred memory policy
> > > because that one doesn't push heavily on the preferred node either. So
> > > dropping just the direct reclaim mode makes some sense to me.
> > > 
> > > IIRC this is something I was recommending in an early proposal of the
> > > feature.
> > 
> > My assumption [FWIW] is that the usecases we've outlined for multi-preferred
> > would want more heavy pushing on the preference mask. However, maybe the uapi
> > could dictate how hard to try/not try.
> 
> What does that mean and what is the expectation from the kernel to be
> more or less cast in stone?
> 

(I'm not positive I've understood your question, so correct me if I
misunderstood)

I'm not sure there is a stone-cast way to define it nor should we. At the very
least though, something in uapi that has a general mapping to GFP flags
(specifically around reclaim) for the first round of allocation could make
sense.

In my head there are 3 levels of request possible for multiple nodes:
1. BIND: Those nodes or die.
2. Preferred hard: Those nodes and I'm willing to wait. Fallback if impossible.
3. Preferred soft: Those nodes but I don't want to wait.

Current UAPI in the series doesn't define a distinction between 2, and 3. As I
understand the change, Feng is defining the behavior to be #3, which makes #2
not an option. I sort of punted on defining it entirely, in the beginning.
Feng Tang March 4, 2021, 8:14 a.m. UTC | #13
On Wed, Mar 03, 2021 at 09:22:50AM -0800, Ben Widawsky wrote:
> On 21-03-03 18:14:30, Michal Hocko wrote:
> > On Wed 03-03-21 08:31:41, Ben Widawsky wrote:
> > > On 21-03-03 14:59:35, Michal Hocko wrote:
> > > > On Wed 03-03-21 21:46:44, Feng Tang wrote:
> > > > > On Wed, Mar 03, 2021 at 09:18:32PM +0800, Tang, Feng wrote:
> > > > > > On Wed, Mar 03, 2021 at 01:32:11PM +0100, Michal Hocko wrote:
> > > > > > > On Wed 03-03-21 20:18:33, Feng Tang wrote:
> > > > [...]
> > > > > > > > One thing I tried which can fix the slowness is:
> > > > > > > > 
> > > > > > > > +	gfp_mask &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);
> > > > > > > > 
> > > > > > > > which explicitly clears the 2 kinds of reclaim. And I thought it's too
> > > > > > > > hacky and didn't mention it in the commit log.
> > > > > > > 
> > > > > > > Clearing __GFP_DIRECT_RECLAIM would be the right way to achieve
> > > > > > > GFP_NOWAIT semantic. Why would you want to exclude kswapd as well? 
> > > > > > 
> > > > > > When I tried gfp_mask &= ~__GFP_DIRECT_RECLAIM, the slowness couldn't
> > > > > > be fixed.
> > > > > 
> > > > > I just double checked by rerun the test, 'gfp_mask &= ~__GFP_DIRECT_RECLAIM'
> > > > > can also accelerate the allocation much! though is still a little slower than
> > > > > this patch. Seems I've messed some of the tries, and sorry for the confusion!
> > > > > 
> > > > > Could this be used as the solution? or the adding another fallback_nodemask way?
> > > > > but the latter will change the current API quite a bit.
> > > > 
> > > > I haven't got to the whole series yet. The real question is whether the
> > > > first attempt to enforce the preferred mask is a general win. I would
> > > > argue that it resembles the existing single node preferred memory policy
> > > > because that one doesn't push heavily on the preferred node either. So
> > > > dropping just the direct reclaim mode makes some sense to me.
> > > > 
> > > > IIRC this is something I was recommending in an early proposal of the
> > > > feature.
> > > 
> > > My assumption [FWIW] is that the usecases we've outlined for multi-preferred
> > > would want more heavy pushing on the preference mask. However, maybe the uapi
> > > could dictate how hard to try/not try.
> > 
> > What does that mean and what is the expectation from the kernel to be
> > more or less cast in stone?
> > 
> 
> (I'm not positive I've understood your question, so correct me if I
> misunderstood)
> 
> I'm not sure there is a stone-cast way to define it nor should we. At the very
> least though, something in uapi that has a general mapping to GFP flags
> (specifically around reclaim) for the first round of allocation could make
> sense.
> 
> In my head there are 3 levels of request possible for multiple nodes:
> 1. BIND: Those nodes or die.
> 2. Preferred hard: Those nodes and I'm willing to wait. Fallback if impossible.
> 3. Preferred soft: Those nodes but I don't want to wait.
> 
> Current UAPI in the series doesn't define a distinction between 2, and 3. As I
> understand the change, Feng is defining the behavior to be #3, which makes #2
> not an option. I sort of punted on defining it entirely, in the beginning.

As discussed earlier in the thread, one less hacky solution is to clear
__GFP_DIRECT_RECLAIM bit so that it won't go into direct reclaim, but still
wakeup the kswapd of target nodes and retry, which sits now between 'Preferred hard'
and 'Preferred soft' :)

For current MPOL_PREFERRED, its semantic is also 'Preferred hard', that it
will check free memory of other nodes before entering slowpath waiting.

Thanks,
Feng
Michal Hocko March 4, 2021, 12:57 p.m. UTC | #14
On Wed 03-03-21 09:22:50, Ben Widawsky wrote:
> On 21-03-03 18:14:30, Michal Hocko wrote:
> > On Wed 03-03-21 08:31:41, Ben Widawsky wrote:
> > > On 21-03-03 14:59:35, Michal Hocko wrote:
> > > > On Wed 03-03-21 21:46:44, Feng Tang wrote:
> > > > > On Wed, Mar 03, 2021 at 09:18:32PM +0800, Tang, Feng wrote:
> > > > > > On Wed, Mar 03, 2021 at 01:32:11PM +0100, Michal Hocko wrote:
> > > > > > > On Wed 03-03-21 20:18:33, Feng Tang wrote:
> > > > [...]
> > > > > > > > One thing I tried which can fix the slowness is:
> > > > > > > > 
> > > > > > > > +	gfp_mask &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);
> > > > > > > > 
> > > > > > > > which explicitly clears the 2 kinds of reclaim. And I thought it's too
> > > > > > > > hacky and didn't mention it in the commit log.
> > > > > > > 
> > > > > > > Clearing __GFP_DIRECT_RECLAIM would be the right way to achieve
> > > > > > > GFP_NOWAIT semantic. Why would you want to exclude kswapd as well? 
> > > > > > 
> > > > > > When I tried gfp_mask &= ~__GFP_DIRECT_RECLAIM, the slowness couldn't
> > > > > > be fixed.
> > > > > 
> > > > > I just double checked by rerun the test, 'gfp_mask &= ~__GFP_DIRECT_RECLAIM'
> > > > > can also accelerate the allocation much! though is still a little slower than
> > > > > this patch. Seems I've messed some of the tries, and sorry for the confusion!
> > > > > 
> > > > > Could this be used as the solution? or the adding another fallback_nodemask way?
> > > > > but the latter will change the current API quite a bit.
> > > > 
> > > > I haven't got to the whole series yet. The real question is whether the
> > > > first attempt to enforce the preferred mask is a general win. I would
> > > > argue that it resembles the existing single node preferred memory policy
> > > > because that one doesn't push heavily on the preferred node either. So
> > > > dropping just the direct reclaim mode makes some sense to me.
> > > > 
> > > > IIRC this is something I was recommending in an early proposal of the
> > > > feature.
> > > 
> > > My assumption [FWIW] is that the usecases we've outlined for multi-preferred
> > > would want more heavy pushing on the preference mask. However, maybe the uapi
> > > could dictate how hard to try/not try.
> > 
> > What does that mean and what is the expectation from the kernel to be
> > more or less cast in stone?
> > 
> 
> (I'm not positive I've understood your question, so correct me if I
> misunderstood)
> 
> I'm not sure there is a stone-cast way to define it nor should we.

OK, I thought you want the behavior to diverge from the existing
MPOL_PREFERRED which only prefers the configured node as a default but
the allocator is free to fallback to any other node under memory
pressure. For the multiple preferred nodes the same should be applied
and only attempt lightweight attempt before falling back to full
nodeset. Your paragraph I was replying to is not in line with this
though.

> At the very
> least though, something in uapi that has a general mapping to GFP flags
> (specifically around reclaim) for the first round of allocation could make
> sense.

I do not think this is a good idea.

> In my head there are 3 levels of request possible for multiple nodes:
> 1. BIND: Those nodes or die.
> 2. Preferred hard: Those nodes and I'm willing to wait. Fallback if impossible.
> 3. Preferred soft: Those nodes but I don't want to wait.

I do agree that an intermediate "preference" can be helpful because
binding is just too strict and OOM semantic is far from ideal. But this
would need a new policy.
 
> Current UAPI in the series doesn't define a distinction between 2, and 3. As I
> understand the change, Feng is defining the behavior to be #3, which makes #2
> not an option. I sort of punted on defining it entirely, in the beginning.

I really think it should be in line with the existing preferred policy
behavior.
Michal Hocko March 4, 2021, 12:59 p.m. UTC | #15
On Thu 04-03-21 16:14:14, Feng Tang wrote:
> On Wed, Mar 03, 2021 at 09:22:50AM -0800, Ben Widawsky wrote:
> > On 21-03-03 18:14:30, Michal Hocko wrote:
> > > On Wed 03-03-21 08:31:41, Ben Widawsky wrote:
> > > > On 21-03-03 14:59:35, Michal Hocko wrote:
> > > > > On Wed 03-03-21 21:46:44, Feng Tang wrote:
> > > > > > On Wed, Mar 03, 2021 at 09:18:32PM +0800, Tang, Feng wrote:
> > > > > > > On Wed, Mar 03, 2021 at 01:32:11PM +0100, Michal Hocko wrote:
> > > > > > > > On Wed 03-03-21 20:18:33, Feng Tang wrote:
> > > > > [...]
> > > > > > > > > One thing I tried which can fix the slowness is:
> > > > > > > > > 
> > > > > > > > > +	gfp_mask &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);
> > > > > > > > > 
> > > > > > > > > which explicitly clears the 2 kinds of reclaim. And I thought it's too
> > > > > > > > > hacky and didn't mention it in the commit log.
> > > > > > > > 
> > > > > > > > Clearing __GFP_DIRECT_RECLAIM would be the right way to achieve
> > > > > > > > GFP_NOWAIT semantic. Why would you want to exclude kswapd as well? 
> > > > > > > 
> > > > > > > When I tried gfp_mask &= ~__GFP_DIRECT_RECLAIM, the slowness couldn't
> > > > > > > be fixed.
> > > > > > 
> > > > > > I just double checked by rerun the test, 'gfp_mask &= ~__GFP_DIRECT_RECLAIM'
> > > > > > can also accelerate the allocation much! though is still a little slower than
> > > > > > this patch. Seems I've messed some of the tries, and sorry for the confusion!
> > > > > > 
> > > > > > Could this be used as the solution? or the adding another fallback_nodemask way?
> > > > > > but the latter will change the current API quite a bit.
> > > > > 
> > > > > I haven't got to the whole series yet. The real question is whether the
> > > > > first attempt to enforce the preferred mask is a general win. I would
> > > > > argue that it resembles the existing single node preferred memory policy
> > > > > because that one doesn't push heavily on the preferred node either. So
> > > > > dropping just the direct reclaim mode makes some sense to me.
> > > > > 
> > > > > IIRC this is something I was recommending in an early proposal of the
> > > > > feature.
> > > > 
> > > > My assumption [FWIW] is that the usecases we've outlined for multi-preferred
> > > > would want more heavy pushing on the preference mask. However, maybe the uapi
> > > > could dictate how hard to try/not try.
> > > 
> > > What does that mean and what is the expectation from the kernel to be
> > > more or less cast in stone?
> > > 
> > 
> > (I'm not positive I've understood your question, so correct me if I
> > misunderstood)
> > 
> > I'm not sure there is a stone-cast way to define it nor should we. At the very
> > least though, something in uapi that has a general mapping to GFP flags
> > (specifically around reclaim) for the first round of allocation could make
> > sense.
> > 
> > In my head there are 3 levels of request possible for multiple nodes:
> > 1. BIND: Those nodes or die.
> > 2. Preferred hard: Those nodes and I'm willing to wait. Fallback if impossible.
> > 3. Preferred soft: Those nodes but I don't want to wait.
> > 
> > Current UAPI in the series doesn't define a distinction between 2, and 3. As I
> > understand the change, Feng is defining the behavior to be #3, which makes #2
> > not an option. I sort of punted on defining it entirely, in the beginning.
> 
> As discussed earlier in the thread, one less hacky solution is to clear
> __GFP_DIRECT_RECLAIM bit so that it won't go into direct reclaim, but still
> wakeup the kswapd of target nodes and retry, which sits now between 'Preferred hard'
> and 'Preferred soft' :)

Yes that is what I've had in mind when talking about a lightweight
attempt.

> For current MPOL_PREFERRED, its semantic is also 'Preferred hard', that it

Did you mean to say prefer soft? Because the direct reclaim is attempted
only when node reclaim is enabled.

> will check free memory of other nodes before entering slowpath waiting.

Yes, hence "soft" semantic.
Feng Tang March 5, 2021, 2:21 a.m. UTC | #16
On Thu, Mar 04, 2021 at 01:59:40PM +0100, Michal Hocko wrote:
> On Thu 04-03-21 16:14:14, Feng Tang wrote:
> > On Wed, Mar 03, 2021 at 09:22:50AM -0800, Ben Widawsky wrote:
> > > On 21-03-03 18:14:30, Michal Hocko wrote:
> > > > On Wed 03-03-21 08:31:41, Ben Widawsky wrote:
> > > > > On 21-03-03 14:59:35, Michal Hocko wrote:
> > > > > > On Wed 03-03-21 21:46:44, Feng Tang wrote:
> > > > > > > On Wed, Mar 03, 2021 at 09:18:32PM +0800, Tang, Feng wrote:
> > > > > > > > On Wed, Mar 03, 2021 at 01:32:11PM +0100, Michal Hocko wrote:
> > > > > > > > > On Wed 03-03-21 20:18:33, Feng Tang wrote:
> > > > > > [...]
> > > > > > > > > > One thing I tried which can fix the slowness is:
> > > > > > > > > > 
> > > > > > > > > > +	gfp_mask &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);
> > > > > > > > > > 
> > > > > > > > > > which explicitly clears the 2 kinds of reclaim. And I thought it's too
> > > > > > > > > > hacky and didn't mention it in the commit log.
> > > > > > > > > 
> > > > > > > > > Clearing __GFP_DIRECT_RECLAIM would be the right way to achieve
> > > > > > > > > GFP_NOWAIT semantic. Why would you want to exclude kswapd as well? 
> > > > > > > > 
> > > > > > > > When I tried gfp_mask &= ~__GFP_DIRECT_RECLAIM, the slowness couldn't
> > > > > > > > be fixed.
> > > > > > > 
> > > > > > > I just double checked by rerun the test, 'gfp_mask &= ~__GFP_DIRECT_RECLAIM'
> > > > > > > can also accelerate the allocation much! though is still a little slower than
> > > > > > > this patch. Seems I've messed some of the tries, and sorry for the confusion!
> > > > > > > 
> > > > > > > Could this be used as the solution? or the adding another fallback_nodemask way?
> > > > > > > but the latter will change the current API quite a bit.
> > > > > > 
> > > > > > I haven't got to the whole series yet. The real question is whether the
> > > > > > first attempt to enforce the preferred mask is a general win. I would
> > > > > > argue that it resembles the existing single node preferred memory policy
> > > > > > because that one doesn't push heavily on the preferred node either. So
> > > > > > dropping just the direct reclaim mode makes some sense to me.
> > > > > > 
> > > > > > IIRC this is something I was recommending in an early proposal of the
> > > > > > feature.
> > > > > 
> > > > > My assumption [FWIW] is that the usecases we've outlined for multi-preferred
> > > > > would want more heavy pushing on the preference mask. However, maybe the uapi
> > > > > could dictate how hard to try/not try.
> > > > 
> > > > What does that mean and what is the expectation from the kernel to be
> > > > more or less cast in stone?
> > > > 
> > > 
> > > (I'm not positive I've understood your question, so correct me if I
> > > misunderstood)
> > > 
> > > I'm not sure there is a stone-cast way to define it nor should we. At the very
> > > least though, something in uapi that has a general mapping to GFP flags
> > > (specifically around reclaim) for the first round of allocation could make
> > > sense.
> > > 
> > > In my head there are 3 levels of request possible for multiple nodes:
> > > 1. BIND: Those nodes or die.
> > > 2. Preferred hard: Those nodes and I'm willing to wait. Fallback if impossible.
> > > 3. Preferred soft: Those nodes but I don't want to wait.
> > > 
> > > Current UAPI in the series doesn't define a distinction between 2, and 3. As I
> > > understand the change, Feng is defining the behavior to be #3, which makes #2
> > > not an option. I sort of punted on defining it entirely, in the beginning.
> > 
> > As discussed earlier in the thread, one less hacky solution is to clear
> > __GFP_DIRECT_RECLAIM bit so that it won't go into direct reclaim, but still
> > wakeup the kswapd of target nodes and retry, which sits now between 'Preferred hard'
> > and 'Preferred soft' :)
> 
> Yes that is what I've had in mind when talking about a lightweight
> attempt.
> 
> > For current MPOL_PREFERRED, its semantic is also 'Preferred hard', that it
> 
> Did you mean to say prefer soft? Because the direct reclaim is attempted
> only when node reclaim is enabled.
> 
> > will check free memory of other nodes before entering slowpath waiting.
> 
> Yes, hence "soft" semantic.

Yes, it's the #3 item: 'Preferred soft' 

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs
Feng Tang March 10, 2021, 5:19 a.m. UTC | #17
On Wed, Mar 03, 2021 at 08:48:58AM -0800, Dave Hansen wrote:
> On 3/3/21 8:31 AM, Ben Widawsky wrote:
> >> I haven't got to the whole series yet. The real question is whether the
> >> first attempt to enforce the preferred mask is a general win. I would
> >> argue that it resembles the existing single node preferred memory policy
> >> because that one doesn't push heavily on the preferred node either. So
> >> dropping just the direct reclaim mode makes some sense to me.
> >>
> >> IIRC this is something I was recommending in an early proposal of the
> >> feature.
> > My assumption [FWIW] is that the usecases we've outlined for multi-preferred
> > would want more heavy pushing on the preference mask. However, maybe the uapi
> > could dictate how hard to try/not try.
> 
> There are two things that I think are important:
> 
> 1. MPOL_PREFERRED_MANY fallback away from the preferred nodes should be
>    *temporary*, even in the face of the preferred set being full.  That
>    means that _some_ reclaim needs to be done.  Kicking off kswapd is
>    fine for this.
> 2. MPOL_PREFERRED_MANY behavior should resemble MPOL_PREFERRED as
>    closely as possible.  We're just going to confuse users if they set a
>    single node in a MPOL_PREFERRED_MANY mask and get different behavior
>    from MPOL_PREFERRED.
> 
> While it would be nice, short-term, to steer MPOL_PREFERRED_MANY
> behavior toward how we expect it to get used first, I think it's a
> mistake if we do it at the cost of long-term divergence from MPOL_PREFERRED.

Hi All,

Based on the discussion, I update the patch as below, please review, thanks


From ea9e32fa8b6eff4a64d790b856e044adb30f04b5 Mon Sep 17 00:00:00 2001
From: Feng Tang <feng.tang@intel.com>
Date: Wed, 10 Mar 2021 12:31:24 +0800
Subject: [PATCH] mm/mempolicy: speedup page alloc for MPOL_PREFERRED_MANY

When doing broader test, we noticed allocation slowness in one test
case that malloc memory with size which is slightly bigger than free
memory of targeted nodes, but much less then the total free memory
of system.

The reason is the code enters the slowpath of __alloc_pages_nodemask(),
which takes quite some time.

Since alloc_pages_policy() will give it a 2nd try with NULL nodemask,
we tried solution which creates a new gfp_mask bit __GFP_NO_SLOWPATH
for explicitely skipping entering slowpath in the first try, which is
brutal and costs one precious gfp mask bit.

Based on discussion with Michal/Ben/Dave [1], only skip entering direct
reclaim while still allowing it to wakeup kswapd, which can fix the
slowness and make MPOL_PREFERRED_MANY more close to the semantic of
MPOL_PREFERRED, while avoid creating a new gfp bit.

[1]. https://lore.kernel.org/lkml/1614766858-90344-15-git-send-email-feng.tang@intel.com/
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/mempolicy.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d66c1c0..00b19f7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2205,9 +2205,13 @@ static struct page *alloc_pages_policy(struct mempolicy *pol, gfp_t gfp,
 	 * | MPOL_PREFERRED_MANY (round 2) | local         | NULL       |
 	 * +-------------------------------+---------------+------------+
 	 */
-	if (pol->mode == MPOL_PREFERRED_MANY)
+	if (pol->mode == MPOL_PREFERRED_MANY) {
 		gfp_mask |= __GFP_RETRY_MAYFAIL | __GFP_NOWARN;
 
+		/* Skip direct reclaim, as there will be a second try */
+		gfp_mask &= ~__GFP_DIRECT_RECLAIM;
+	}
+
 	page = __alloc_pages_nodemask(gfp_mask, order,
 				      policy_node(gfp, pol, preferred_nid),
 				      policy_nodemask(gfp, pol));
Michal Hocko March 10, 2021, 9:44 a.m. UTC | #18
On Wed 10-03-21 13:19:47, Feng Tang wrote:
[...]
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d66c1c0..00b19f7 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2205,9 +2205,13 @@ static struct page *alloc_pages_policy(struct mempolicy *pol, gfp_t gfp,
>  	 * | MPOL_PREFERRED_MANY (round 2) | local         | NULL       |
>  	 * +-------------------------------+---------------+------------+
>  	 */
> -	if (pol->mode == MPOL_PREFERRED_MANY)
> +	if (pol->mode == MPOL_PREFERRED_MANY) {
>  		gfp_mask |= __GFP_RETRY_MAYFAIL | __GFP_NOWARN;
>  
> +		/* Skip direct reclaim, as there will be a second try */
> +		gfp_mask &= ~__GFP_DIRECT_RECLAIM;

__GFP_RETRY_MAYFAIL is a reclaim modifier which doesn't make any sense
without __GFP_DIRECT_RECLAIM. Also I think it would be better to have a
proper allocation flags in the initial patch which implements the
fallback.

> +	}
> +
>  	page = __alloc_pages_nodemask(gfp_mask, order,
>  				      policy_node(gfp, pol, preferred_nid),
>  				      policy_nodemask(gfp, pol));
> -- 
> 2.7.4
> 
>
Feng Tang March 10, 2021, 11:49 a.m. UTC | #19
On Wed, Mar 10, 2021 at 10:44:11AM +0100, Michal Hocko wrote:
> On Wed 10-03-21 13:19:47, Feng Tang wrote:
> [...]
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index d66c1c0..00b19f7 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2205,9 +2205,13 @@ static struct page *alloc_pages_policy(struct mempolicy *pol, gfp_t gfp,
> >  	 * | MPOL_PREFERRED_MANY (round 2) | local         | NULL       |
> >  	 * +-------------------------------+---------------+------------+
> >  	 */
> > -	if (pol->mode == MPOL_PREFERRED_MANY)
> > +	if (pol->mode == MPOL_PREFERRED_MANY) {
> >  		gfp_mask |= __GFP_RETRY_MAYFAIL | __GFP_NOWARN;
> >  
> > +		/* Skip direct reclaim, as there will be a second try */
> > +		gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> 
> __GFP_RETRY_MAYFAIL is a reclaim modifier which doesn't make any sense
> without __GFP_DIRECT_RECLAIM. Also I think it would be better to have a
> proper allocation flags in the initial patch which implements the
> fallback.

Ok, will remove the __GFP_RETRY_MAYFAIL setting and folder this with
previous patch(8/14).

Thanks,
Feng

> > +	}
> > +
> >  	page = __alloc_pages_nodemask(gfp_mask, order,
> >  				      policy_node(gfp, pol, preferred_nid),
> >  				      policy_nodemask(gfp, pol));
> > -- 
> > 2.7.4
> > 
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
diff mbox series

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6e479e9..81bacbe 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,8 +39,9 @@  struct vm_area_struct;
 #define ___GFP_HARDWALL		0x100000u
 #define ___GFP_THISNODE		0x200000u
 #define ___GFP_ACCOUNT		0x400000u
+#define ___GFP_NO_SLOWPATH	0x800000u
 #ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP	0x800000u
+#define ___GFP_NOLOCKDEP	0x1000000u
 #else
 #define ___GFP_NOLOCKDEP	0
 #endif
@@ -220,11 +221,15 @@  struct vm_area_struct;
 #define __GFP_COMP	((__force gfp_t)___GFP_COMP)
 #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)
 
+/* Do not go into the slowpath */
+#define __GFP_NO_SLOWPATH	((__force gfp_t)___GFP_NO_SLOWPATH)
+
 /* Disable lockdep for GFP context tracking */
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
+
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (24 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /**
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d66c1c0..e84b56d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2206,7 +2206,7 @@  static struct page *alloc_pages_policy(struct mempolicy *pol, gfp_t gfp,
 	 * +-------------------------------+---------------+------------+
 	 */
 	if (pol->mode == MPOL_PREFERRED_MANY)
-		gfp_mask |= __GFP_RETRY_MAYFAIL | __GFP_NOWARN;
+		gfp_mask |= __GFP_RETRY_MAYFAIL | __GFP_NOWARN | __GFP_NO_SLOWPATH;
 
 	page = __alloc_pages_nodemask(gfp_mask, order,
 				      policy_node(gfp, pol, preferred_nid),
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 519a60d..969e3a1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4993,7 +4993,7 @@  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
-	if (likely(page))
+	if (likely(page) || (gfp_mask & __GFP_NO_SLOWPATH))
 		goto out;
 
 	/*