[rfc] mm, hugetlb: allow hugepage allocations to excessively reclaim
diff mbox series

Message ID alpine.DEB.2.21.1910021556270.187014@chino.kir.corp.google.com
State New
Headers show
Series
  • [rfc] mm, hugetlb: allow hugepage allocations to excessively reclaim
Related show

Commit Message

David Rientjes Oct. 2, 2019, 11:03 p.m. UTC
Hugetlb allocations use __GFP_RETRY_MAYFAIL to aggressively attempt to get 
hugepages that the user needs.  Commit b39d0ee2632d ("mm, page_alloc: 
avoid expensive reclaim when compaction may not succeed") intends to 
improve allocator behind for thp allocations to prevent excessive amounts 
of reclaim especially when constrained to a single node.

Since hugetlb allocations have explicitly preferred to loop and do reclaim 
and compaction, exempt them from this new behavior at least for the time 
being.  It is not shown that hugetlb allocation success rate has been 
impacted by commit b39d0ee2632d but hugetlb allocations are admittedly 
beyond the scope of what the patch is intended to address (thp 
allocations).

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Mike, you eluded that you may want to opt hugetlbfs out of this for the
 time being in https://marc.info/?l=linux-kernel&m=156771690024533 --
 not sure if you want to allow this excessive amount of reclaim for 
 hugetlb allocations or not given the swap storms Andrea has shown is
 possible (and nr_hugepages_mempolicy does exist), but hugetlbfs was not
 part of the problem we are trying to address here so no objection to
 opting it out.  

 You might want to consider how expensive hugetlb allocations can become
 and disruptive to the system if it does not yield additional hugepages,
 but that can be done at any time later as a general improvement rather
 than part of a series aimed at thp.

 mm/page_alloc.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Linus Torvalds Oct. 2, 2019, 11:37 p.m. UTC | #1
On Wed, Oct 2, 2019 at 4:03 PM David Rientjes <rientjes@google.com> wrote:
>
> Since hugetlb allocations have explicitly preferred to loop and do reclaim
> and compaction, exempt them from this new behavior at least for the time
> being.  It is not shown that hugetlb allocation success rate has been
> impacted by commit b39d0ee2632d but hugetlb allocations are admittedly
> beyond the scope of what the patch is intended to address (thp
> allocations).

I'd like to see some numbers to show that this special case makes sense.

I understand the "this is what it used to do, and hugetlbfs wasn't the
intended recipient of the new semantics", and I don't think the patch
is wrong.

But at the same time, we do know that swap storms happen for other
loads, and if we say "hugetlbfs is different" then there should at
least be some rationale for why it's different other than "history".
Some actual "yes, we _want_ the possibile swap storms, because load
XYZ".

And I don't mean microbenchmark numbers for "look, behavior changed".
I mean "look, this is a real load, and now it runs X% slower because
it relied on this hugetlbfs behavior".

               Linus
Michal Hocko Oct. 3, 2019, 5 a.m. UTC | #2
On Wed 02-10-19 16:37:57, Linus Torvalds wrote:
> On Wed, Oct 2, 2019 at 4:03 PM David Rientjes <rientjes@google.com> wrote:
> >
> > Since hugetlb allocations have explicitly preferred to loop and do reclaim
> > and compaction, exempt them from this new behavior at least for the time
> > being.  It is not shown that hugetlb allocation success rate has been
> > impacted by commit b39d0ee2632d but hugetlb allocations are admittedly
> > beyond the scope of what the patch is intended to address (thp
> > allocations).
> 
> I'd like to see some numbers to show that this special case makes sense.

http://lkml.kernel.org/r/20191001054343.GA15624@dhcp22.suse.cz
While the test is somehow artificial it is not too much different from
real workloads which do preallocate a non trivial (50% in my case) of
memory for hugetlb pages. Having a moderately utilized memory (by page
cache in my case) is not really unexpected.

> I understand the "this is what it used to do, and hugetlbfs wasn't the
> intended recipient of the new semantics", and I don't think the patch
> is wrong.

This is not only about this used to work. It is an expected and
documented semantic of __GFP_RETRY_MAYFAIL

 * %__GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim
 * procedures that have previously failed if there is some indication
 * that progress has been made else where.  It can wait for other
 * tasks to attempt high level approaches to freeing memory such as
 * compaction (which removes fragmentation) and page-out.
 * There is still a definite limit to the number of retries, but it is
 * a larger limit than with %__GFP_NORETRY.
 * Allocations with this flag may fail, but only when there is
 * genuinely little unused memory. While these allocations do not
 * directly trigger the OOM killer, their failure indicates that
 * the system is likely to need to use the OOM killer soon.  The
 * caller must handle failure, but can reasonably do so by failing
 * a higher-level request, or completing it only in a much less
 * efficient manner.
 
> But at the same time, we do know that swap storms happen for other
> loads, and if we say "hugetlbfs is different" then there should at
> least be some rationale for why it's different other than "history".
> Some actual "yes, we _want_ the possibile swap storms, because load
> XYZ".
> 
> And I don't mean microbenchmark numbers for "look, behavior changed".
> I mean "look, this is a real load, and now it runs X% slower because
> it relied on this hugetlbfs behavior".

It is not about running slower. It is about not getting the expected
amount of hugetlb pages requested by admin who knows that that size is
needed.
Michal Hocko Oct. 3, 2019, 5:27 a.m. UTC | #3
On Wed 02-10-19 16:03:03, David Rientjes wrote:
> Hugetlb allocations use __GFP_RETRY_MAYFAIL to aggressively attempt to get 
> hugepages that the user needs.  Commit b39d0ee2632d ("mm, page_alloc: 
> avoid expensive reclaim when compaction may not succeed") intends to 
> improve allocator behind for thp allocations to prevent excessive amounts 
> of reclaim especially when constrained to a single node.
> 
> Since hugetlb allocations have explicitly preferred to loop and do reclaim 
> and compaction, exempt them from this new behavior at least for the time 
> being.  It is not shown that hugetlb allocation success rate has been 
> impacted by commit b39d0ee2632d but hugetlb allocations are admittedly 
> beyond the scope of what the patch is intended to address (thp 
> allocations).

It has become pretty clear that b39d0ee2632d has regressed hugetlb
allocation success rate for any non-trivial case (complately free
memory) http://lkml.kernel.org/r/20191001054343.GA15624@dhcp22.suse.cz.
And this really is not just about hugetlb requests, really. They are
likely the most obvious example but __GFP_RETRY_MAYFAIL in general is
supposed to try as hard as feasible to success the allocation. The
decision to bail out is done at a different spot and b39d0ee2632d is
effectively bypassing that logic.

Now to the patch itself. I didn't get to test it on my testing
workload but hey steps are clearly documented and easily to set up and
reproduce. I am at a training for today and unlikely to get to test by
the end of the week infortunatelly. Anyway the patch should be fixing
the problem because it explicitly opts out for __GFP_RETRY_MAYFAIL.

I am pretty sure we will need more follow ups because the bail out logic
is simply behaving quite randomly as my measurements show (I would really
appreciate a feedback there). We need a more systematic solution because
the current logic has been rushed through without a proper analysis and
without any actual workloads to verify the effect.

> Cc: Mike Kravetz <mike.kravetz@oracle.com>
Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")

> Signed-off-by: David Rientjes <rientjes@google.com>

I am willing to give my ack by considering that this is a clear
regression and this is probably the simplest fix but the changelog
should be explicit about the effect (feel free to borrow my numbers and
explanation in this thread).

> ---
>  Mike, you eluded that you may want to opt hugetlbfs out of this for the
>  time being in https://marc.info/?l=linux-kernel&m=156771690024533 --
>  not sure if you want to allow this excessive amount of reclaim for 
>  hugetlb allocations or not given the swap storms Andrea has shown is
>  possible (and nr_hugepages_mempolicy does exist), but hugetlbfs was not
>  part of the problem we are trying to address here so no objection to
>  opting it out.  
> 
>  You might want to consider how expensive hugetlb allocations can become
>  and disruptive to the system if it does not yield additional hugepages,
>  but that can be done at any time later as a general improvement rather
>  than part of a series aimed at thp.
> 
>  mm/page_alloc.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4467,12 +4467,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		if (page)
>  			goto got_pg;
>  
> -		 if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
> +		 if (order >= pageblock_order && (gfp_mask & __GFP_IO) &&
> +		     !(gfp_mask & __GFP_RETRY_MAYFAIL)) {
>  			/*
>  			 * If allocating entire pageblock(s) and compaction
>  			 * failed because all zones are below low watermarks
>  			 * or is prohibited because it recently failed at this
> -			 * order, fail immediately.
> +			 * order, fail immediately unless the allocator has
> +			 * requested compaction and reclaim retry.
>  			 *
>  			 * Reclaim is
>  			 *  - potentially very expensive because zones are far
Vlastimil Babka Oct. 3, 2019, 8:14 a.m. UTC | #4
On 10/3/19 1:03 AM, David Rientjes wrote:
> Hugetlb allocations use __GFP_RETRY_MAYFAIL to aggressively attempt to get 
> hugepages that the user needs.  Commit b39d0ee2632d ("mm, page_alloc: 
> avoid expensive reclaim when compaction may not succeed") intends to 
> improve allocator behind for thp allocations to prevent excessive amounts 
> of reclaim especially when constrained to a single node.
> 
> Since hugetlb allocations have explicitly preferred to loop and do reclaim 
> and compaction, exempt them from this new behavior at least for the time 
> being.  It is not shown that hugetlb allocation success rate has been 
> impacted by commit b39d0ee2632d but hugetlb allocations are admittedly 
> beyond the scope of what the patch is intended to address (thp 
> allocations).
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Mike, you eluded that you may want to opt hugetlbfs out of this for the
>  time being in https://marc.info/?l=linux-kernel&m=156771690024533 --

I think the key differences between Mike's tests and Michal's is this part
from Mike's mail linked above:

"I 'tested' by simply creating some background activity and then seeing
how many hugetlb pages could be allocated. Of course, many tries over
time in a loop."

- "some background activity" might be different than Michal's pre-filling
  of the memory with (clean) page cache
- "many tries over time in a loop" could mean that kswapd has time to 
  reclaim and eventually the new condition for pageblock order will pass
  every few retries, because there's enough memory for compaction and it
  won't return COMPACT_SKIPPED

>  not sure if you want to allow this excessive amount of reclaim for 
>  hugetlb allocations or not given the swap storms Andrea has shown is

More precisely this is about hugetlb reservations by admin, not allocations
by the program. It's when admin uses the appropriate sysctl to say how many
hugetlb pages to reserve. In that case they expect that memory will be
reclaimed as needed. I don't think we should complicate the admin action
by requiring e.g. a sync+drop_caches before that, or retrying in the loop.
It's a one time action, not a continuous swap storm by a stream of THP
allocations.

>  possible (and nr_hugepages_mempolicy does exist), but hugetlbfs was not
>  part of the problem we are trying to address here so no objection to
>  opting it out.  
> 
>  You might want to consider how expensive hugetlb allocations can become
>  and disruptive to the system if it does not yield additional hugepages,

Yes, there have been recent issues with the action not terminating properly
in the case there's nothing more to reclaim (i.e. admin asking for an unrealistic
number of hugetlb pages), but that has been addressed (IIRC already merged
from mmotm to 5.4-rc1). It was actually an improvement to the reclaim/compaction
feedback that everybody asks for, although the result is obviously still
not perfect.
David Rientjes Oct. 3, 2019, 7:52 p.m. UTC | #5
On Thu, 3 Oct 2019, Vlastimil Babka wrote:

> I think the key differences between Mike's tests and Michal's is this part
> from Mike's mail linked above:
> 
> "I 'tested' by simply creating some background activity and then seeing
> how many hugetlb pages could be allocated. Of course, many tries over
> time in a loop."
> 
> - "some background activity" might be different than Michal's pre-filling
>   of the memory with (clean) page cache
> - "many tries over time in a loop" could mean that kswapd has time to 
>   reclaim and eventually the new condition for pageblock order will pass
>   every few retries, because there's enough memory for compaction and it
>   won't return COMPACT_SKIPPED
> 

I'll rely on Mike, the hugetlb maintainer, to assess the trade-off between 
the potential for encountering very expensive reclaim as Andrea did and 
the possibility of being able to allocate additional hugetlb pages at 
runtime if we did that expensive reclaim.

For parity with previous kernels it seems reasonable to ask that this 
remains unchanged since allocating large amounts of hugetlb pages has 
different latency expectations than during page fault.  This patch is 
available if he'd prefer to go that route.

On the other hand, userspace could achieve similar results if it were to 
use vm.drop_caches and explicitly triggered compaction through either 
procfs or sysfs before writing to vm.nr_hugepages, and that would be much 
faster because it would be done in one go.  Users who allocate through the 
kernel command line would obviously be unaffected.

Commit b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when 
compaction may not succeed") was written with the latter in mind.  Mike 
subsequently requested that hugetlb not be impacted at least provisionally 
until it could be further assessed.

I'd suggest that latter: let the user initiate expensive reclaim and/or 
compaction when tuning vm.nr_hugepages and leave no surprises for users 
using hugetlb overcommit, but I wouldn't argue against either approach, he 
knows the users and expectations of hugetlb far better than I do.
Michal Hocko Oct. 4, 2019, 9:28 a.m. UTC | #6
On Thu 03-10-19 12:52:33, David Rientjes wrote:
> On Thu, 3 Oct 2019, Vlastimil Babka wrote:
> 
> > I think the key differences between Mike's tests and Michal's is this part
> > from Mike's mail linked above:
> > 
> > "I 'tested' by simply creating some background activity and then seeing
> > how many hugetlb pages could be allocated. Of course, many tries over
> > time in a loop."
> > 
> > - "some background activity" might be different than Michal's pre-filling
> >   of the memory with (clean) page cache
> > - "many tries over time in a loop" could mean that kswapd has time to 
> >   reclaim and eventually the new condition for pageblock order will pass
> >   every few retries, because there's enough memory for compaction and it
> >   won't return COMPACT_SKIPPED
> > 
> 
> I'll rely on Mike, the hugetlb maintainer, to assess the trade-off between 
> the potential for encountering very expensive reclaim as Andrea did and 
> the possibility of being able to allocate additional hugetlb pages at 
> runtime if we did that expensive reclaim.

That tradeoff has been expressed by __GFP_RETRY_MAYFAIL which got broken
by b39d0ee2632d.

> For parity with previous kernels it seems reasonable to ask that this 
> remains unchanged since allocating large amounts of hugetlb pages has 
> different latency expectations than during page fault.  This patch is 
> available if he'd prefer to go that route.
> 
> On the other hand, userspace could achieve similar results if it were to 
> use vm.drop_caches and explicitly triggered compaction through either 
> procfs or sysfs before writing to vm.nr_hugepages, and that would be much 
> faster because it would be done in one go.  Users who allocate through the 
> kernel command line would obviously be unaffected.

Requesting the userspace to drop _all_ page cache in order allocate a
number of hugetlb pages or any other affected __GFP_RETRY_MAYFAIL
requests is simply not reasonable IMHO.
David Rientjes Oct. 4, 2019, 6:02 p.m. UTC | #7
On Fri, 4 Oct 2019, Michal Hocko wrote:

> Requesting the userspace to drop _all_ page cache in order allocate a
> number of hugetlb pages or any other affected __GFP_RETRY_MAYFAIL
> requests is simply not reasonable IMHO.

It can be used as a fallback when writing to nr_hugepages and the amount 
allocated did not match expectation.  Again, I'll defer all of this to 
Mike when he returns: he expressed his preference, I suggested an 
alternative to consider, and he can make the decision to ack or nack this 
patch because he has a better understanding of that expectation from users 
who use hugetlb pages.
Mike Kravetz Oct. 7, 2019, 10:15 p.m. UTC | #8
On 10/4/19 11:02 AM, David Rientjes wrote:
> On Fri, 4 Oct 2019, Michal Hocko wrote:
> 
>> Requesting the userspace to drop _all_ page cache in order allocate a
>> number of hugetlb pages or any other affected __GFP_RETRY_MAYFAIL
>> requests is simply not reasonable IMHO.
> 
> It can be used as a fallback when writing to nr_hugepages and the amount 
> allocated did not match expectation.  Again, I'll defer all of this to 
> Mike when he returns: he expressed his preference, I suggested an 
> alternative to consider, and he can make the decision to ack or nack this 
> patch because he has a better understanding of that expectation from users 
> who use hugetlb pages.

I believe these modifications to commit b39d0ee2632d are absolutely necessary
to maintain expected hugetlbfs functionality.  Michal's simple test in the
rewritten commit message shows the type of regressions that I expect some
hugetlbfs users to experience.  The expectation today is that the kernel will
try hard to allocate the requested number of hugetlb pages.  These pages are
often used for very long running processes.  Therefore, the tradeoff of more
reclaim (and compaction) activity up front to create the pages is generally
acceptable.

My apologies if the 'testing' I did in [1] was taken as an endorsement of
b39d0ee2632d working well with hugetlbfs.  It was a limited test that I knew
did not cover all cases.  Therefore, I suggested that if b39d0ee2632d went
forward there should be an exception for __GFP_RETRY_MAYFAIL requests.

[1] https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com

Patch
diff mbox series

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4467,12 +4467,14 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		if (page)
 			goto got_pg;
 
-		 if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
+		 if (order >= pageblock_order && (gfp_mask & __GFP_IO) &&
+		     !(gfp_mask & __GFP_RETRY_MAYFAIL)) {
 			/*
 			 * If allocating entire pageblock(s) and compaction
 			 * failed because all zones are below low watermarks
 			 * or is prohibited because it recently failed at this
-			 * order, fail immediately.
+			 * order, fail immediately unless the allocator has
+			 * requested compaction and reclaim retry.
 			 *
 			 * Reclaim is
 			 *  - potentially very expensive because zones are far