mbox series

[RESEND,0/8] hugetlb: add demote/split page functionality

Message ID 20210816224953.157796-1-mike.kravetz@oracle.com (mailing list archive)
Headers show
Series hugetlb: add demote/split page functionality | expand

Message

Mike Kravetz Aug. 16, 2021, 10:49 p.m. UTC
This is a resend of PATCHes sent here [4].  There was some discussion
and interest when the RFC [5] was sent, but little after that.  The
resend is just a rebase of [4] to next-20210816 with a few typos in
commmit messages fixed.

Original Cover Letter
---------------------
The concurrent use of multiple hugetlb page sizes on a single system
is becoming more common.  One of the reasons is better TLB support for
gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
being used to back VMs in hosting environments.

When using hugetlb pages to back VMs in such environments, it is
sometimes desirable to preallocate hugetlb pools.  This avoids the delay
and uncertainty of allocating hugetlb pages at VM startup.  In addition,
preallocating huge pages minimizes the issue of memory fragmentation that
increases the longer the system is up and running.

In such environments, a combination of larger and smaller hugetlb pages
are preallocated in anticipation of backing VMs of various sizes.  Over
time, the preallocated pool of smaller hugetlb pages may become
depleted while larger hugetlb pages still remain.  In such situations,
it may be desirable to convert larger hugetlb pages to smaller hugetlb
pages.

Converting larger to smaller hugetlb pages can be accomplished today by
first freeing the larger page to the buddy allocator and then allocating
the smaller pages.  However, there are two issues with this approach:
1) This process can take quite some time, especially if allocation of
   the smaller pages is not immediate and requires migration/compaction.
2) There is no guarantee that the total size of smaller pages allocated
   will match the size of the larger page which was freed.  This is
   because the area freed by the larger page could quickly be
   fragmented.

To address these issues, introduce the concept of hugetlb page demotion.
Demotion provides a means of 'in place' splitting a hugetlb page to
pages of a smaller size.  For example, on x86 one 1G page can be
demoted to 512 2M pages.  Page demotion is controlled via sysfs files.
- demote_size   Read only target page size for demotion
- demote        Writable number of hugetlb pages to be demoted

Only hugetlb pages which are free at the time of the request can be demoted.
Demotion does not add to the complexity surplus pages.  Demotion also honors
reserved huge pages.  Therefore, when a value is written to the sysfs demote
file, that value is only the maximum number of pages which will be demoted.
It is possible fewer will actually be demoted.

If demote_size is PAGESIZE, demote will simply free pages to the buddy
allocator.

Real world use cases
--------------------
There are groups today using hugetlb pages to back VMs on x86.  Their
use case is as described above.  They have experienced the issues with
performance and not necessarily getting the excepted number smaller huge
pages after free/allocate cycle.

Note to reviewers
-----------------
Patches 1-5 provide the basic demote functionality.  They are built on
	next-20210721.
Patch 3 deals with this issue of speculative page references as
	discussed in [1] and [2].  It builds on the ideas used in
	patches currently in mmotm.  There have been few comments on
	those patches in mmotm, so I do not feel the approach has been
	well vetted.
Patches 6-8 are an optimization to deal with vmemmap optimized pages.
	This was discussed in the RFC.  IMO, the code may not be worth
	the benefit.  They could be dropped with no loss of
	functionality.  In addition, Muchun has recently sent patches to
	further optimize hugetlb vmemmap reduction by only requiring one
	vmemmap page per huge page [3].  These patches do not take Muchun's
	new patches into account.

[1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/20210710002441.167759-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20210714091800.42645-1-songmuchun@bytedance.com/
[4] https://lore.kernel.org/linux-mm/20210721230511.201823-1-mike.kravetz@oracle.com/
[5] https://lore.kernel.org/linux-mm/20210309001855.142453-1-mike.kravetz@oracle.com/

v1 -> RESEND
  - Rebase on next-20210816
  - Fix a few typos in commit messages
RFC -> v1
  - Provides basic support for vmemmap optimized pages
  - Takes speculative page references into account
  - Updated Documentation file
  - Added optimizations for vmemmap optimized pages

Mike Kravetz (8):
  hugetlb: add demote hugetlb page sysfs interfaces
  hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA
  hugetlb: add demote bool to gigantic page routines
  hugetlb: add hugetlb demote page support
  hugetlb: document the demote sysfs interfaces
  hugetlb: vmemmap optimizations when demoting hugetlb pages
  hugetlb: prepare destroy and prep routines for vmemmap optimized pages
  hugetlb: Optimized demote vmemmap optimizatized pages

 Documentation/admin-guide/mm/hugetlbpage.rst |  29 +-
 include/linux/hugetlb.h                      |   8 +
 include/linux/mm.h                           |   4 +
 mm/hugetlb.c                                 | 327 +++++++++++++++++--
 mm/hugetlb_vmemmap.c                         |  72 +++-
 mm/hugetlb_vmemmap.h                         |  16 +
 mm/sparse-vmemmap.c                          | 123 ++++++-
 7 files changed, 538 insertions(+), 41 deletions(-)

Comments

Andrew Morton Aug. 16, 2021, 11:23 p.m. UTC | #1
On Mon, 16 Aug 2021 15:49:45 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> This is a resend of PATCHes sent here [4].  There was some discussion
> and interest when the RFC [5] was sent, but little after that.  The
> resend is just a rebase of [4] to next-20210816 with a few typos in
> commmit messages fixed.
> 
> Original Cover Letter
> ---------------------
> The concurrent use of multiple hugetlb page sizes on a single system
> is becoming more common.  One of the reasons is better TLB support for
> gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
> being used to back VMs in hosting environments.
> 
> When using hugetlb pages to back VMs in such environments, it is
> sometimes desirable to preallocate hugetlb pools.  This avoids the delay
> and uncertainty of allocating hugetlb pages at VM startup.  In addition,
> preallocating huge pages minimizes the issue of memory fragmentation that
> increases the longer the system is up and running.
> 
> In such environments, a combination of larger and smaller hugetlb pages
> are preallocated in anticipation of backing VMs of various sizes.  Over
> time, the preallocated pool of smaller hugetlb pages may become
> depleted while larger hugetlb pages still remain.  In such situations,
> it may be desirable to convert larger hugetlb pages to smaller hugetlb
> pages.
> 
> Converting larger to smaller hugetlb pages can be accomplished today by
> first freeing the larger page to the buddy allocator and then allocating
> the smaller pages.  However, there are two issues with this approach:
> 1) This process can take quite some time, especially if allocation of
>    the smaller pages is not immediate and requires migration/compaction.
> 2) There is no guarantee that the total size of smaller pages allocated
>    will match the size of the larger page which was freed.  This is
>    because the area freed by the larger page could quickly be
>    fragmented.
> 
> To address these issues, introduce the concept of hugetlb page demotion.
> Demotion provides a means of 'in place' splitting a hugetlb page to
> pages of a smaller size.  For example, on x86 one 1G page can be
> demoted to 512 2M pages.  Page demotion is controlled via sysfs files.
> - demote_size   Read only target page size for demotion

Should this be "write only"?  If not, I'm confused.

If "yes" then "write only" would be a misnomer - clearly this file is
readable (looks at demote_size_show()).

> - demote        Writable number of hugetlb pages to be demoted

So how does this interface work?  Write the target size to
`demote_size', write the number of to-be-demoted larger pages to
`demote' and then the operation happens?

If so, how does one select which size pages should be selected for
the demotion?

And how does one know the operation has completed so the sysfs files
can be reloaded for another operation?

> Only hugetlb pages which are free at the time of the request can be demoted.
> Demotion does not add to the complexity surplus pages.  Demotion also honors
> reserved huge pages.  Therefore, when a value is written to the sysfs demote
> file, that value is only the maximum number of pages which will be demoted.
> It is possible fewer will actually be demoted.
> 
> If demote_size is PAGESIZE, demote will simply free pages to the buddy
> allocator.
> 
> Real world use cases
> --------------------
> There are groups today using hugetlb pages to back VMs on x86.  Their
> use case is as described above.  They have experienced the issues with
> performance and not necessarily getting the excepted number smaller huge

("expected")

> pages after free/allocate cycle.
> 

It seems odd to add the interfaces in patch 1 then document them in
patch 5.  Why not add-and-document in a single patch?
Andrew Morton Aug. 16, 2021, 11:27 p.m. UTC | #2
Also, pushback...

On Mon, 16 Aug 2021 15:49:45 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> 
> Real world use cases
> --------------------
> There are groups today using hugetlb pages to back VMs on x86.  Their
> use case is as described above.  They have experienced the issues with
> performance and not necessarily getting the excepted number smaller huge

("number of")

> pages after free/allocate cycle.
> 

It really is a ton of new code.  I think we're owed much more detail
about the problem than the above.  To be confident that all this
material is truly justified?

Also, some selftests and benchmarking code in tools/testing/selftests
would be helpful?
Mike Kravetz Aug. 17, 2021, 12:17 a.m. UTC | #3
On 8/16/21 4:23 PM, Andrew Morton wrote:
> On Mon, 16 Aug 2021 15:49:45 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> This is a resend of PATCHes sent here [4].  There was some discussion
>> and interest when the RFC [5] was sent, but little after that.  The
>> resend is just a rebase of [4] to next-20210816 with a few typos in
>> commmit messages fixed.
>>
>> Original Cover Letter
>> ---------------------
>> The concurrent use of multiple hugetlb page sizes on a single system
>> is becoming more common.  One of the reasons is better TLB support for
>> gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
>> being used to back VMs in hosting environments.
>>
>> When using hugetlb pages to back VMs in such environments, it is
>> sometimes desirable to preallocate hugetlb pools.  This avoids the delay
>> and uncertainty of allocating hugetlb pages at VM startup.  In addition,
>> preallocating huge pages minimizes the issue of memory fragmentation that
>> increases the longer the system is up and running.
>>
>> In such environments, a combination of larger and smaller hugetlb pages
>> are preallocated in anticipation of backing VMs of various sizes.  Over
>> time, the preallocated pool of smaller hugetlb pages may become
>> depleted while larger hugetlb pages still remain.  In such situations,
>> it may be desirable to convert larger hugetlb pages to smaller hugetlb
>> pages.
>>
>> Converting larger to smaller hugetlb pages can be accomplished today by
>> first freeing the larger page to the buddy allocator and then allocating
>> the smaller pages.  However, there are two issues with this approach:
>> 1) This process can take quite some time, especially if allocation of
>>    the smaller pages is not immediate and requires migration/compaction.
>> 2) There is no guarantee that the total size of smaller pages allocated
>>    will match the size of the larger page which was freed.  This is
>>    because the area freed by the larger page could quickly be
>>    fragmented.
>>
>> To address these issues, introduce the concept of hugetlb page demotion.
>> Demotion provides a means of 'in place' splitting a hugetlb page to
>> pages of a smaller size.  For example, on x86 one 1G page can be
>> demoted to 512 2M pages.  Page demotion is controlled via sysfs files.
>> - demote_size   Read only target page size for demotion
> 
> Should this be "write only"?  If not, I'm confused.
> 
> If "yes" then "write only" would be a misnomer - clearly this file is
> readable (looks at demote_size_show()).
> 

It is read only and is there mostly as information for the user.  When
they demote a page, this is the size to which the page will be demoted.

For example,
# pwd
/sys/kernel/mm/hugepages/hugepages-1048576kB
# cat demote_size
2048kB
# pwd
/sys/kernel/mm/hugepages/hugepages-2048kB
# cat demote_size
4kB

The "demote size" is not user configurable.  Although, that is
something brought up by Oscar previously.  I did not directly address
this in the RFC.  My bad.  However, I do not like the idea of making
demote_size writable/selectable.  My concern would be someone changing
the value and not resetting.  It certainly is something that can be done
with minor code changes.

>> - demote        Writable number of hugetlb pages to be demoted
> 
> So how does this interface work?  Write the target size to
> `demote_size', write the number of to-be-demoted larger pages to
> `demote' and then the operation happens?
> 
> If so, how does one select which size pages should be selected for
> the demotion?

The location in the sysfs directory tells you what size pages will be
demoted.  For example,

echo 5 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote

says to demote 5 1GB pages.

demote files are also in node specific directories so you can even pick
huge pages from a specific node.

echo 5 >
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/demote

> 
> And how does one know the operation has completed so the sysfs files
> can be reloaded for another operation?
> 

When the write to the file is complete, the operation has completed.
Not exactly sure what you mean by reloading the sysfs files for
another operation?

>> Only hugetlb pages which are free at the time of the request can be demoted.
>> Demotion does not add to the complexity surplus pages.  Demotion also honors
>> reserved huge pages.  Therefore, when a value is written to the sysfs demote
>> file, that value is only the maximum number of pages which will be demoted.
>> It is possible fewer will actually be demoted.
>>
>> If demote_size is PAGESIZE, demote will simply free pages to the buddy
>> allocator.
>>
>> Real world use cases
>> --------------------
>> There are groups today using hugetlb pages to back VMs on x86.  Their
>> use case is as described above.  They have experienced the issues with
>> performance and not necessarily getting the excepted number smaller huge
> 
> ("expected")

yes, will fix typo

> 
>> pages after free/allocate cycle.
>>
> 
> It seems odd to add the interfaces in patch 1 then document them in
> patch 5.  Why not add-and-document in a single patch?
> 

Yes, makes sense.  Will combine these.
Andrew Morton Aug. 17, 2021, 12:39 a.m. UTC | #4
On Mon, 16 Aug 2021 17:17:50 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> > 
> > And how does one know the operation has completed so the sysfs files
> > can be reloaded for another operation?
> > 
> 
> When the write to the file is complete, the operation has completed.
> Not exactly sure what you mean by reloading the sysfs files for
> another operation?

If userspace wishes to perform another demote operation, it must wait
for the preceding one to complete.

Presumably if thread A is blocked in a write to `demote' and thread B
concurrently tries to perform a demotion, thread B will be blocked as
well?  Was this tested?

Lots of things are to be added to changelogs & documentation, methinks.
Mike Kravetz Aug. 17, 2021, 12:46 a.m. UTC | #5
On 8/16/21 4:27 PM, Andrew Morton wrote:
> Also, pushback...

That is welcome.  I only have the one specific use case mentioned here.

> 
> On Mon, 16 Aug 2021 15:49:45 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>>
>> Real world use cases
>> --------------------
>> There are groups today using hugetlb pages to back VMs on x86.  Their
>> use case is as described above.  They have experienced the issues with
>> performance and not necessarily getting the excepted number smaller huge
> 
> ("number of")

thanks, another typo to fix.

> 
>> pages after free/allocate cycle.
>>
> 
> It really is a ton of new code.  I think we're owed much more detail
> about the problem than the above.  To be confident that all this
> material is truly justified?

The desired functionality for this specific use case is to simply
convert a 1G huegtlb page to 512 2MB hugetlb pages.  As mentioned

"Converting larger to smaller hugetlb pages can be accomplished today by
 first freeing the larger page to the buddy allocator and then allocating
 the smaller pages.  However, there are two issues with this approach:
 1) This process can take quite some time, especially if allocation of
    the smaller pages is not immediate and requires migration/compaction.
 2) There is no guarantee that the total size of smaller pages allocated
    will match the size of the larger page which was freed.  This is
    because the area freed by the larger page could quickly be
    fragmented."

These two issues have been experienced in practice.

A big chunk of the code changes (aprox 50%) is for the vmemmap
optimizations.  This is also the most complex part of the changes.
I added the code as interaction with vmemmap reduction was discussed
during the RFC.  It is only a performance enhancement and honestly
may not be worth the cost/risk.  I will get some numbers to measure
the actual benefit.

> 
> Also, some selftests and benchmarking code in tools/testing/selftests
> would be helpful?
> 

Will do.
Mike Kravetz Aug. 17, 2021, 12:58 a.m. UTC | #6
On 8/16/21 5:39 PM, Andrew Morton wrote:
> On Mon, 16 Aug 2021 17:17:50 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>>>
>>> And how does one know the operation has completed so the sysfs files
>>> can be reloaded for another operation?
>>>
>>
>> When the write to the file is complete, the operation has completed.
>> Not exactly sure what you mean by reloading the sysfs files for
>> another operation?
> 
> If userspace wishes to perform another demote operation, it must wait
> for the preceding one to complete.
> 
> Presumably if thread A is blocked in a write to `demote' and thread B
> concurrently tries to perform a demotion, thread B will be blocked as
> well?  Was this tested?

I must admit that I did not specifically test this.  However, the
patch series to make freeing of hugetlb pages IRQ safe added a
(per-hstate)mutex that is taken when sysfs files modify the number of
huge pages.  demote writes take this mutex (patch 1).  That synchronizes
not only concurrent writes to demote but also concurrent modifications
of nr_hugepages.

> 
> Lots of things are to be added to changelogs & documentation, methinks.
> 

OK.

Thank you for taking a look at this!
Andrew Morton Aug. 17, 2021, 1:46 a.m. UTC | #7
On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> > It really is a ton of new code.  I think we're owed much more detail
> > about the problem than the above.  To be confident that all this
> > material is truly justified?
> 
> The desired functionality for this specific use case is to simply
> convert a 1G huegtlb page to 512 2MB hugetlb pages.  As mentioned
> 
> "Converting larger to smaller hugetlb pages can be accomplished today by
>  first freeing the larger page to the buddy allocator and then allocating
>  the smaller pages.  However, there are two issues with this approach:
>  1) This process can take quite some time, especially if allocation of
>     the smaller pages is not immediate and requires migration/compaction.
>  2) There is no guarantee that the total size of smaller pages allocated
>     will match the size of the larger page which was freed.  This is
>     because the area freed by the larger page could quickly be
>     fragmented."
> 
> These two issues have been experienced in practice.

Well the first issue is quantifiable.  What is "some time"?  If it's
people trying to get a 5% speedup on a rare operation because hey,
bugging the kernel developers doesn't cost me anything then perhaps we
have better things to be doing.

And the second problem would benefit from some words to help us
understand how much real-world hurt this causes, and how frequently.
And let's understand what the userspace workarounds look like, etc.

> A big chunk of the code changes (aprox 50%) is for the vmemmap
> optimizations.  This is also the most complex part of the changes.
> I added the code as interaction with vmemmap reduction was discussed
> during the RFC.  It is only a performance enhancement and honestly
> may not be worth the cost/risk.  I will get some numbers to measure
> the actual benefit.

Cool.
David Hildenbrand Aug. 17, 2021, 7:30 a.m. UTC | #8
On 17.08.21 03:46, Andrew Morton wrote:
> On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>>> It really is a ton of new code.  I think we're owed much more detail
>>> about the problem than the above.  To be confident that all this
>>> material is truly justified?
>>
>> The desired functionality for this specific use case is to simply
>> convert a 1G huegtlb page to 512 2MB hugetlb pages.  As mentioned
>>
>> "Converting larger to smaller hugetlb pages can be accomplished today by
>>   first freeing the larger page to the buddy allocator and then allocating
>>   the smaller pages.  However, there are two issues with this approach:
>>   1) This process can take quite some time, especially if allocation of
>>      the smaller pages is not immediate and requires migration/compaction.
>>   2) There is no guarantee that the total size of smaller pages allocated
>>      will match the size of the larger page which was freed.  This is
>>      because the area freed by the larger page could quickly be
>>      fragmented."
>>
>> These two issues have been experienced in practice.
> 
> Well the first issue is quantifiable.  What is "some time"?  If it's
> people trying to get a 5% speedup on a rare operation because hey,
> bugging the kernel developers doesn't cost me anything then perhaps we
> have better things to be doing.
> 
> And the second problem would benefit from some words to help us
> understand how much real-world hurt this causes, and how frequently.
> And let's understand what the userspace workarounds look like, etc.
> 
>> A big chunk of the code changes (aprox 50%) is for the vmemmap
>> optimizations.  This is also the most complex part of the changes.
>> I added the code as interaction with vmemmap reduction was discussed
>> during the RFC.  It is only a performance enhancement and honestly
>> may not be worth the cost/risk.  I will get some numbers to measure
>> the actual benefit.

If it really makes that much of a difference code/complexity wise, would 
it make sense to just limit denote functionality to the !vmemmap case 
for now?
Mike Kravetz Aug. 17, 2021, 4:19 p.m. UTC | #9
On 8/17/21 12:30 AM, David Hildenbrand wrote:
> On 17.08.21 03:46, Andrew Morton wrote:
>> On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>>>> It really is a ton of new code.  I think we're owed much more detail
>>>> about the problem than the above.  To be confident that all this
>>>> material is truly justified?
>>>
>>> The desired functionality for this specific use case is to simply
>>> convert a 1G huegtlb page to 512 2MB hugetlb pages.  As mentioned
>>>
>>> "Converting larger to smaller hugetlb pages can be accomplished today by
>>>   first freeing the larger page to the buddy allocator and then allocating
>>>   the smaller pages.  However, there are two issues with this approach:
>>>   1) This process can take quite some time, especially if allocation of
>>>      the smaller pages is not immediate and requires migration/compaction.
>>>   2) There is no guarantee that the total size of smaller pages allocated
>>>      will match the size of the larger page which was freed.  This is
>>>      because the area freed by the larger page could quickly be
>>>      fragmented."
>>>
>>> These two issues have been experienced in practice.
>>
>> Well the first issue is quantifiable.  What is "some time"?  If it's
>> people trying to get a 5% speedup on a rare operation because hey,
>> bugging the kernel developers doesn't cost me anything then perhaps we
>> have better things to be doing.
>>
>> And the second problem would benefit from some words to help us
>> understand how much real-world hurt this causes, and how frequently.
>> And let's understand what the userspace workarounds look like, etc.
>>
>>> A big chunk of the code changes (aprox 50%) is for the vmemmap
>>> optimizations.  This is also the most complex part of the changes.
>>> I added the code as interaction with vmemmap reduction was discussed
>>> during the RFC.  It is only a performance enhancement and honestly
>>> may not be worth the cost/risk.  I will get some numbers to measure
>>> the actual benefit.
> 
> If it really makes that much of a difference code/complexity wise, would it make sense to just limit denote functionality to the !vmemmap case for now?
> 

Handling vmemmap optimized huge pages is not that big of a deal.  We
just use the existing functionality to populate vmemmap for the page
being demoted, and free vmemmap for resulting pages of demoted size.

This obviously is not 'optimal' for demote as we will allocate more
vmemmap pages than needed and then free the excess pages.  The complex
part is not over allocating vmemmap and only sparsely populating vmemmap
for the target pages of demote size.  This is all done in patches 6-8.
I am happy to drop these patches for now.  The are the most complex (and
ugly) of this series.  As mentioned, they do not provide any additional
functionality.
David Hildenbrand Aug. 17, 2021, 6:49 p.m. UTC | #10
On 17.08.21 18:19, Mike Kravetz wrote:
> On 8/17/21 12:30 AM, David Hildenbrand wrote:
>> On 17.08.21 03:46, Andrew Morton wrote:
>>> On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>>>> It really is a ton of new code.  I think we're owed much more detail
>>>>> about the problem than the above.  To be confident that all this
>>>>> material is truly justified?
>>>>
>>>> The desired functionality for this specific use case is to simply
>>>> convert a 1G huegtlb page to 512 2MB hugetlb pages.  As mentioned
>>>>
>>>> "Converting larger to smaller hugetlb pages can be accomplished today by
>>>>    first freeing the larger page to the buddy allocator and then allocating
>>>>    the smaller pages.  However, there are two issues with this approach:
>>>>    1) This process can take quite some time, especially if allocation of
>>>>       the smaller pages is not immediate and requires migration/compaction.
>>>>    2) There is no guarantee that the total size of smaller pages allocated
>>>>       will match the size of the larger page which was freed.  This is
>>>>       because the area freed by the larger page could quickly be
>>>>       fragmented."
>>>>
>>>> These two issues have been experienced in practice.
>>>
>>> Well the first issue is quantifiable.  What is "some time"?  If it's
>>> people trying to get a 5% speedup on a rare operation because hey,
>>> bugging the kernel developers doesn't cost me anything then perhaps we
>>> have better things to be doing.
>>>
>>> And the second problem would benefit from some words to help us
>>> understand how much real-world hurt this causes, and how frequently.
>>> And let's understand what the userspace workarounds look like, etc.
>>>
>>>> A big chunk of the code changes (aprox 50%) is for the vmemmap
>>>> optimizations.  This is also the most complex part of the changes.
>>>> I added the code as interaction with vmemmap reduction was discussed
>>>> during the RFC.  It is only a performance enhancement and honestly
>>>> may not be worth the cost/risk.  I will get some numbers to measure
>>>> the actual benefit.
>>
>> If it really makes that much of a difference code/complexity wise, would it make sense to just limit denote functionality to the !vmemmap case for now?
>>
> 
> Handling vmemmap optimized huge pages is not that big of a deal.  We
> just use the existing functionality to populate vmemmap for the page
> being demoted, and free vmemmap for resulting pages of demoted size.
> 
> This obviously is not 'optimal' for demote as we will allocate more
> vmemmap pages than needed and then free the excess pages.  The complex
> part is not over allocating vmemmap and only sparsely populating vmemmap
> for the target pages of demote size.  This is all done in patches 6-8.
> I am happy to drop these patches for now.  The are the most complex (and
> ugly) of this series.  As mentioned, they do not provide any additional
> functionality.
> 

Just looking at the diffstat, that looks like a good idea to me :)
Mike Kravetz Aug. 24, 2021, 10:08 p.m. UTC | #11
Add Vlastimil and Hillf,

On 8/16/21 6:46 PM, Andrew Morton wrote:
> On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>>> It really is a ton of new code.  I think we're owed much more detail
>>> about the problem than the above.  To be confident that all this
>>> material is truly justified?
>>
>> The desired functionality for this specific use case is to simply
>> convert a 1G huegtlb page to 512 2MB hugetlb pages.  As mentioned
>>
>> "Converting larger to smaller hugetlb pages can be accomplished today by
>>  first freeing the larger page to the buddy allocator and then allocating
>>  the smaller pages.  However, there are two issues with this approach:
>>  1) This process can take quite some time, especially if allocation of
>>     the smaller pages is not immediate and requires migration/compaction.
>>  2) There is no guarantee that the total size of smaller pages allocated
>>     will match the size of the larger page which was freed.  This is
>>     because the area freed by the larger page could quickly be
>>     fragmented."
>>
>> These two issues have been experienced in practice.
> 
> Well the first issue is quantifiable.  What is "some time"?  If it's
> people trying to get a 5% speedup on a rare operation because hey,
> bugging the kernel developers doesn't cost me anything then perhaps we
> have better things to be doing.

Well, I set up a test environment on a larger system to get some
numbers.  My 'load' on the system was filling the page cache with
clean pages.  The thought is that these pages could easily be reclaimed.

When trying to get numbers I hit a hugetlb page allocation stall where
__alloc_pages(__GFP_RETRY_MAYFAIL, order 9) would stall forever (or at
least an hour).  It was very much like the symptoms addressed here:
https://lore.kernel.org/linux-mm/20190806014744.15446-1-mike.kravetz@oracle.com/

This was on 5.14.0-rc6-next-20210820.

I'll do some more digging as this appears to be some dark corner case of
reclaim and/or compaction.  The 'good news' is that I can reproduce
this.

> And the second problem would benefit from some words to help us
> understand how much real-world hurt this causes, and how frequently.
> And let's understand what the userspace workarounds look like, etc.

The stall above was from doing a simple 'free 1GB page' followed by
'allocate 512 MB pages' from userspace.

Getting out another version of this series will be delayed, as I think
we need to address or understand this issue first.
Hillf Danton Aug. 26, 2021, 7:32 a.m. UTC | #12
On Tue, 24 Aug 2021 15:08:46 -0700 Mike Kravetz wrote:
>On 8/16/21 6:46 PM, Andrew Morton wrote:
>> On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
>> 
>>>> It really is a ton of new code.  I think we're owed much more detail
>>>> about the problem than the above.  To be confident that all this
>>>> material is truly justified?
>>>
>>> The desired functionality for this specific use case is to simply
>>> convert a 1G huegtlb page to 512 2MB hugetlb pages.  As mentioned
>>>
>>> "Converting larger to smaller hugetlb pages can be accomplished today by
>>>  first freeing the larger page to the buddy allocator and then allocating
>>>  the smaller pages.  However, there are two issues with this approach:
>>>  1) This process can take quite some time, especially if allocation of
>>>     the smaller pages is not immediate and requires migration/compaction.
>>>  2) There is no guarantee that the total size of smaller pages allocated
>>>     will match the size of the larger page which was freed.  This is
>>>     because the area freed by the larger page could quickly be
>>>     fragmented."
>>>
>>> These two issues have been experienced in practice.
>> 
>> Well the first issue is quantifiable.  What is "some time"?  If it's
>> people trying to get a 5% speedup on a rare operation because hey,
>> bugging the kernel developers doesn't cost me anything then perhaps we
>> have better things to be doing.
>
>Well, I set up a test environment on a larger system to get some
>numbers.  My 'load' on the system was filling the page cache with
>clean pages.  The thought is that these pages could easily be reclaimed.
>
>When trying to get numbers I hit a hugetlb page allocation stall where
>__alloc_pages(__GFP_RETRY_MAYFAIL, order 9) would stall forever (or at
>least an hour).  It was very much like the symptoms addressed here:
>https://lore.kernel.org/linux-mm/20190806014744.15446-1-mike.kravetz@oracle.com/
>
>This was on 5.14.0-rc6-next-20210820.
>
>I'll do some more digging as this appears to be some dark corner case of
>reclaim and/or compaction.  The 'good news' is that I can reproduce
>this.

I am on vacation until 1 Sep, with an ear on any light on the corner
cases.

Hillf
>
>> And the second problem would benefit from some words to help us
>> understand how much real-world hurt this causes, and how frequently.
>> And let's understand what the userspace workarounds look like, etc.
>
>The stall above was from doing a simple 'free 1GB page' followed by
>'allocate 512 MB pages' from userspace.
>
>Getting out another version of this series will be delayed, as I think
>we need to address or understand this issue first.
>-- 
>Mike Kravetz
Vlastimil Babka Aug. 27, 2021, 5:22 p.m. UTC | #13
On 8/25/21 00:08, Mike Kravetz wrote:
> Add Vlastimil and Hillf,
> 
> Well, I set up a test environment on a larger system to get some
> numbers.  My 'load' on the system was filling the page cache with
> clean pages.  The thought is that these pages could easily be reclaimed.
> 
> When trying to get numbers I hit a hugetlb page allocation stall where
> __alloc_pages(__GFP_RETRY_MAYFAIL, order 9) would stall forever (or at
> least an hour).  It was very much like the symptoms addressed here:
> https://lore.kernel.org/linux-mm/20190806014744.15446-1-mike.kravetz@oracle.com/
> 
> This was on 5.14.0-rc6-next-20210820.
> 
> I'll do some more digging as this appears to be some dark corner case of
> reclaim and/or compaction.  The 'good news' is that I can reproduce
> this.

Interesting, let's see if that's some kind of new regression.

>> And the second problem would benefit from some words to help us
>> understand how much real-world hurt this causes, and how frequently.
>> And let's understand what the userspace workarounds look like, etc.
> 
> The stall above was from doing a simple 'free 1GB page' followed by
> 'allocate 512 MB pages' from userspace.

Is the allocation different in any way than the usual hugepage allocation
possible today?

> Getting out another version of this series will be delayed, as I think
> we need to address or understand this issue first.
>
Mike Kravetz Aug. 27, 2021, 11:04 p.m. UTC | #14
On 8/27/21 10:22 AM, Vlastimil Babka wrote:
> On 8/25/21 00:08, Mike Kravetz wrote:
>> Add Vlastimil and Hillf,
>>
>> Well, I set up a test environment on a larger system to get some
>> numbers.  My 'load' on the system was filling the page cache with
>> clean pages.  The thought is that these pages could easily be reclaimed.
>>
>> When trying to get numbers I hit a hugetlb page allocation stall where
>> __alloc_pages(__GFP_RETRY_MAYFAIL, order 9) would stall forever (or at
>> least an hour).  It was very much like the symptoms addressed here:
>> https://lore.kernel.org/linux-mm/20190806014744.15446-1-mike.kravetz@oracle.com/
>>
>> This was on 5.14.0-rc6-next-20210820.
>>
>> I'll do some more digging as this appears to be some dark corner case of
>> reclaim and/or compaction.  The 'good news' is that I can reproduce
>> this.
> 
> Interesting, let's see if that's some kind of new regression.
> 
>>> And the second problem would benefit from some words to help us
>>> understand how much real-world hurt this causes, and how frequently.
>>> And let's understand what the userspace workarounds look like, etc.
>>
>> The stall above was from doing a simple 'free 1GB page' followed by
>> 'allocate 512 MB pages' from userspace.
> 
> Is the allocation different in any way than the usual hugepage allocation
> possible today?

No, it is the same.  I was just following the demote use case of free
1GB page and allocate 512 2MB pages.  Of course, you would mostly expect
that to succeed.  The exception is if something else is running and
grabs some of those 1GB worth of contiguous pages such that you can not
allocate 512 2MB pages.  My test case was to have those freed pages be
used for file I/O but kept clean so that could easily be reclaimed.

I 'may' have been over stressing the system with all CPUs doing file
reads to fill the page cache with clean pages.  I certainly need to
spend some more debug/analysis time on this.
Vlastimil Babka Aug. 30, 2021, 10:11 a.m. UTC | #15
On 8/28/21 01:04, Mike Kravetz wrote:
> On 8/27/21 10:22 AM, Vlastimil Babka wrote:
> I 'may' have been over stressing the system with all CPUs doing file
> reads to fill the page cache with clean pages.  I certainly need to
> spend some more debug/analysis time on this.

Hm that *could* play a role, as these will allow reclaim to make progress, but
also the reclaimed pages might be stolen immediately and compaction will return
COMPACT_SKIPPED and in should_compact_retry() we might go through this code path:

        /*      
         * compaction was skipped because there are not enough order-0 pages
         * to work with, so we retry only if it looks like reclaim can help.
         */
        if (compaction_needs_reclaim(compact_result)) {
                ret = compaction_zonelist_suitable(ac, order, alloc_flags);
                goto out;
        }       

where compaction_zonelist_suitable() will return true because it appears
reclaim can free pages to allow progress. And there are no max retries
applied for this case.
With the reclaim and compaction tracepoints it should be possible to
confirm this scenario.
Mike Kravetz Sept. 2, 2021, 6:17 p.m. UTC | #16
On 8/30/21 3:11 AM, Vlastimil Babka wrote:
> On 8/28/21 01:04, Mike Kravetz wrote:
>> On 8/27/21 10:22 AM, Vlastimil Babka wrote:
>> I 'may' have been over stressing the system with all CPUs doing file
>> reads to fill the page cache with clean pages.  I certainly need to
>> spend some more debug/analysis time on this.
> 
> Hm that *could* play a role, as these will allow reclaim to make progress, but
> also the reclaimed pages might be stolen immediately and compaction will return
> COMPACT_SKIPPED and in should_compact_retry() we might go through this code path:
> 
>         /*      
>          * compaction was skipped because there are not enough order-0 pages
>          * to work with, so we retry only if it looks like reclaim can help.
>          */
>         if (compaction_needs_reclaim(compact_result)) {
>                 ret = compaction_zonelist_suitable(ac, order, alloc_flags);
>                 goto out;
>         }       
> 
> where compaction_zonelist_suitable() will return true because it appears
> reclaim can free pages to allow progress. And there are no max retries
> applied for this case.
> With the reclaim and compaction tracepoints it should be possible to
> confirm this scenario.

Here is some very high level information from a long stall that was
interrupted.  This was an order 9 allocation from alloc_buddy_huge_page().

55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1   node 0 FAIL
[55269.539893]     r_tries 25       c_tries 609647   reclaim 47325161 compact 607     

Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted.
should_reclaim_retry returned true 25 times and should_compact_retry returned
true 609647 times.
Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and
607 jiffies spent in __alloc_pages_direct_compact.

Looks like both
reclaim retries > MAX_RECLAIM_RETRIES
and
compaction retries > MAX_COMPACT_RETRIES
Vlastimil Babka Sept. 6, 2021, 2:40 p.m. UTC | #17
On 9/2/21 20:17, Mike Kravetz wrote:
> On 8/30/21 3:11 AM, Vlastimil Babka wrote:
>> On 8/28/21 01:04, Mike Kravetz wrote:
>>> On 8/27/21 10:22 AM, Vlastimil Babka wrote:
>>> I 'may' have been over stressing the system with all CPUs doing file
>>> reads to fill the page cache with clean pages.  I certainly need to
>>> spend some more debug/analysis time on this.
>>
>> Hm that *could* play a role, as these will allow reclaim to make progress, but
>> also the reclaimed pages might be stolen immediately and compaction will return
>> COMPACT_SKIPPED and in should_compact_retry() we might go through this code path:
>>
>>         /*      
>>          * compaction was skipped because there are not enough order-0 pages
>>          * to work with, so we retry only if it looks like reclaim can help.
>>          */
>>         if (compaction_needs_reclaim(compact_result)) {
>>                 ret = compaction_zonelist_suitable(ac, order, alloc_flags);
>>                 goto out;
>>         }       
>>
>> where compaction_zonelist_suitable() will return true because it appears
>> reclaim can free pages to allow progress. And there are no max retries
>> applied for this case.
>> With the reclaim and compaction tracepoints it should be possible to
>> confirm this scenario.
> 
> Here is some very high level information from a long stall that was
> interrupted.  This was an order 9 allocation from alloc_buddy_huge_page().
> 
> 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1   node 0 FAIL
> [55269.539893]     r_tries 25       c_tries 609647   reclaim 47325161 compact 607     
> 
> Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted.
> should_reclaim_retry returned true 25 times and should_compact_retry returned
> true 609647 times.
> Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and
> 607 jiffies spent in __alloc_pages_direct_compact.
> 
> Looks like both
> reclaim retries > MAX_RECLAIM_RETRIES
> and
> compaction retries > MAX_COMPACT_RETRIES
> 

Yeah AFAICS that's only possible with the scenario I suspected. I guess
we should put a limit on compact retries (maybe some multiple of
MAX_COMPACT_RETRIES) even if it thinks that reclaim could help, while
clearly it doesn't (i.e. because somebody else is stealing the page like
in your test case).
Hillf Danton Sept. 7, 2021, 8:50 a.m. UTC | #18
On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote:
>On 9/2/21 20:17, Mike Kravetz wrote:
>> 
>> Here is some very high level information from a long stall that was
>> interrupted.  This was an order 9 allocation from alloc_buddy_huge_page().
>> 
>> 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1   node 0 FAIL
>> [55269.539893]     r_tries 25       c_tries 609647   reclaim 47325161 compact 607     
>> 
>> Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted.
>> should_reclaim_retry returned true 25 times and should_compact_retry returned
>> true 609647 times.
>> Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and
>> 607 jiffies spent in __alloc_pages_direct_compact.
>> 
>> Looks like both
>> reclaim retries > MAX_RECLAIM_RETRIES
>> and
>> compaction retries > MAX_COMPACT_RETRIES
>> 
>Yeah AFAICS that's only possible with the scenario I suspected. I guess
>we should put a limit on compact retries (maybe some multiple of
>MAX_COMPACT_RETRIES) even if it thinks that reclaim could help, while
>clearly it doesn't (i.e. because somebody else is stealing the page like
>in your test case).

And/or clamp reclaim retries for costly orders

	reclaim retries = MAX_RECLAIM_RETRIES - order;

to pull down the chance for stall as low as possible.
Mike Kravetz Sept. 8, 2021, 9 p.m. UTC | #19
On 9/7/21 1:50 AM, Hillf Danton wrote:
> On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote:
>> On 9/2/21 20:17, Mike Kravetz wrote:
>>>
>>> Here is some very high level information from a long stall that was
>>> interrupted.  This was an order 9 allocation from alloc_buddy_huge_page().
>>>
>>> 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1   node 0 FAIL
>>> [55269.539893]     r_tries 25       c_tries 609647   reclaim 47325161 compact 607     
>>>
>>> Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted.
>>> should_reclaim_retry returned true 25 times and should_compact_retry returned
>>> true 609647 times.
>>> Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and
>>> 607 jiffies spent in __alloc_pages_direct_compact.
>>>
>>> Looks like both
>>> reclaim retries > MAX_RECLAIM_RETRIES
>>> and
>>> compaction retries > MAX_COMPACT_RETRIES
>>>
>> Yeah AFAICS that's only possible with the scenario I suspected. I guess
>> we should put a limit on compact retries (maybe some multiple of
>> MAX_COMPACT_RETRIES) even if it thinks that reclaim could help, while
>> clearly it doesn't (i.e. because somebody else is stealing the page like
>> in your test case).
> 
> And/or clamp reclaim retries for costly orders
> 
> 	reclaim retries = MAX_RECLAIM_RETRIES - order;
> 
> to pull down the chance for stall as low as possible.

Thanks, and sorry for not replying quickly.  I only get back to this as
time allows.

We could clamp the number of compaction and reclaim retries in
__alloc_pages_slowpath as suggested.  However, I noticed that a single
reclaim call could take a bunch of time.  As a result, I instrumented
shrink_node to see what might be happening.  Here is some information
from a long stall.  Note that I only dump stats when jiffies > 100000.

[ 8136.874706] shrink_node: 507654 total jiffies,  3557110 tries
[ 8136.881130]              130596341 reclaimed, 32 nr_to_reclaim
[ 8136.887643]              compaction_suitable results:
[ 8136.893276]     idx COMPACT_SKIPPED, 3557109
[ 8672.399839] shrink_node: 522076 total jiffies,  3466228 tries
[ 8672.406268]              124427720 reclaimed, 32 nr_to_reclaim
[ 8672.412782]              compaction_suitable results:
[ 8672.418421]     idx COMPACT_SKIPPED, 3466227
[ 8908.099592] __alloc_pages_slowpath: jiffies 2939938  tries 17068 cpu_tries 1   node 0 success
[ 8908.109120]     r_tries 11       c_tries 17056    reclaim 2939865  compact 9

In this case, clamping the number of retries from should_compact_retry
and should_reclaim_retry could help.  Mostly because we will not be
calling back into the reclaim code?  Notice the long amount of time spent
in shrink_node.  The 'tries' in shrink_node come about from that:

	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
				    sc))
		goto again;

compaction_suitable results is the values returned from calls to
should_continue_reclaim -> compaction_suitable.

Trying to think if there might be an intelligent way to quit early.
Hillf Danton Sept. 9, 2021, 4:07 a.m. UTC | #20
On Wed, 8 Sep 2021 14:00:19 -0700 Mike Kravetz wrote:
>On 9/7/21 1:50 AM, Hillf Danton wrote:
>> On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote:
>>> On 9/2/21 20:17, Mike Kravetz wrote:
>>>>
>>>> Here is some very high level information from a long stall that was
>>>> interrupted.  This was an order 9 allocation from alloc_buddy_huge_page().
>>>>
>>>> 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1   node 0 FAIL
>>>> [55269.539893]     r_tries 25       c_tries 609647   reclaim 47325161 compact 607     
>>>>
>>>> Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted.
>>>> should_reclaim_retry returned true 25 times and should_compact_retry returned
>>>> true 609647 times.
>>>> Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and
>>>> 607 jiffies spent in __alloc_pages_direct_compact.
>>>>
>>>> Looks like both
>>>> reclaim retries > MAX_RECLAIM_RETRIES
>>>> and
>>>> compaction retries > MAX_COMPACT_RETRIES
>>>>
>>> Yeah AFAICS that's only possible with the scenario I suspected. I guess
>>> we should put a limit on compact retries (maybe some multiple of
>>> MAX_COMPACT_RETRIES) even if it thinks that reclaim could help, while
>>> clearly it doesn't (i.e. because somebody else is stealing the page like
>>> in your test case).
>> 
>> And/or clamp reclaim retries for costly orders
>> 
>> 	reclaim retries = MAX_RECLAIM_RETRIES - order;
>> 
>> to pull down the chance for stall as low as possible.
>
>Thanks, and sorry for not replying quickly.  I only get back to this as
>time allows.
>
>We could clamp the number of compaction and reclaim retries in
>__alloc_pages_slowpath as suggested.  However, I noticed that a single
>reclaim call could take a bunch of time.  As a result, I instrumented
>shrink_node to see what might be happening.  Here is some information
>from a long stall.  Note that I only dump stats when jiffies > 100000.
>
>[ 8136.874706] shrink_node: 507654 total jiffies,  3557110 tries
>[ 8136.881130]              130596341 reclaimed, 32 nr_to_reclaim
>[ 8136.887643]              compaction_suitable results:
>[ 8136.893276]     idx COMPACT_SKIPPED, 3557109
>[ 8672.399839] shrink_node: 522076 total jiffies,  3466228 tries
>[ 8672.406268]              124427720 reclaimed, 32 nr_to_reclaim
>[ 8672.412782]              compaction_suitable results:
>[ 8672.418421]     idx COMPACT_SKIPPED, 3466227
>[ 8908.099592] __alloc_pages_slowpath: jiffies 2939938  tries 17068 cpu_tries 1   node 0 success
>[ 8908.109120]     r_tries 11       c_tries 17056    reclaim 2939865  compact 9
>
>In this case, clamping the number of retries from should_compact_retry
>and should_reclaim_retry could help.  Mostly because we will not be
>calling back into the reclaim code?  Notice the long amount of time spent
>in shrink_node.  The 'tries' in shrink_node come about from that:
>
>	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>				    sc))
>		goto again;
>
>compaction_suitable results is the values returned from calls to
>should_continue_reclaim -> compaction_suitable.
>
>Trying to think if there might be an intelligent way to quit early.

Given the downgrade of costly order to zero on the kswapd side, what you
found suggests the need to bridge the gap between sc->nr_to_reclaim and
compact_gap(sc->order) for direct reclaims.

If nr_to_reclaim is the primary target for direct reclaim, one option is
ask kswapd to do costly order reclaims, with stall handled by r_tries and
c_tries in the slowpath.

+++ x/mm/vmscan.c
@@ -3220,7 +3220,11 @@ again:
 
 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 				    sc))
-		goto again;
+		if (!current_is_kswapd() && sc->nr_reclaimed >=
+					    sc->nr_to_reclaim)
+			/* job done */ ;
+		else
+			goto again;
 
 	/*
 	 * Kswapd gives up on balancing particular nodes after too
Michal Hocko Sept. 9, 2021, 11:54 a.m. UTC | #21
On Wed 08-09-21 14:00:19, Mike Kravetz wrote:
> On 9/7/21 1:50 AM, Hillf Danton wrote:
> > On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote:
> >> On 9/2/21 20:17, Mike Kravetz wrote:
> >>>
> >>> Here is some very high level information from a long stall that was
> >>> interrupted.  This was an order 9 allocation from alloc_buddy_huge_page().
> >>>
> >>> 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1   node 0 FAIL
> >>> [55269.539893]     r_tries 25       c_tries 609647   reclaim 47325161 compact 607     
> >>>
> >>> Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted.
> >>> should_reclaim_retry returned true 25 times and should_compact_retry returned
> >>> true 609647 times.
> >>> Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and
> >>> 607 jiffies spent in __alloc_pages_direct_compact.
> >>>
> >>> Looks like both
> >>> reclaim retries > MAX_RECLAIM_RETRIES
> >>> and
> >>> compaction retries > MAX_COMPACT_RETRIES
> >>>
> >> Yeah AFAICS that's only possible with the scenario I suspected. I guess
> >> we should put a limit on compact retries (maybe some multiple of
> >> MAX_COMPACT_RETRIES) even if it thinks that reclaim could help, while
> >> clearly it doesn't (i.e. because somebody else is stealing the page like
> >> in your test case).
> > 
> > And/or clamp reclaim retries for costly orders
> > 
> > 	reclaim retries = MAX_RECLAIM_RETRIES - order;
> > 
> > to pull down the chance for stall as low as possible.
> 
> Thanks, and sorry for not replying quickly.  I only get back to this as
> time allows.
> 
> We could clamp the number of compaction and reclaim retries in
> __alloc_pages_slowpath as suggested.  However, I noticed that a single
> reclaim call could take a bunch of time.  As a result, I instrumented
> shrink_node to see what might be happening.  Here is some information
> from a long stall.  Note that I only dump stats when jiffies > 100000.
> 
> [ 8136.874706] shrink_node: 507654 total jiffies,  3557110 tries
> [ 8136.881130]              130596341 reclaimed, 32 nr_to_reclaim
> [ 8136.887643]              compaction_suitable results:
> [ 8136.893276]     idx COMPACT_SKIPPED, 3557109

Can you get a more detailed break down of where the time is spent. Also
How come the number of reclaimed pages is so excessive comparing to the
reclaim target? There is something fishy going on here.
Vlastimil Babka Sept. 9, 2021, 1:45 p.m. UTC | #22
On 9/9/21 13:54, Michal Hocko wrote:
> On Wed 08-09-21 14:00:19, Mike Kravetz wrote:
>> On 9/7/21 1:50 AM, Hillf Danton wrote:
>> > On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote:
>> > 
>> > And/or clamp reclaim retries for costly orders
>> > 
>> > 	reclaim retries = MAX_RECLAIM_RETRIES - order;
>> > 
>> > to pull down the chance for stall as low as possible.
>> 
>> Thanks, and sorry for not replying quickly.  I only get back to this as
>> time allows.
>> 
>> We could clamp the number of compaction and reclaim retries in
>> __alloc_pages_slowpath as suggested.  However, I noticed that a single
>> reclaim call could take a bunch of time.  As a result, I instrumented
>> shrink_node to see what might be happening.  Here is some information
>> from a long stall.  Note that I only dump stats when jiffies > 100000.
>> 
>> [ 8136.874706] shrink_node: 507654 total jiffies,  3557110 tries
>> [ 8136.881130]              130596341 reclaimed, 32 nr_to_reclaim
>> [ 8136.887643]              compaction_suitable results:
>> [ 8136.893276]     idx COMPACT_SKIPPED, 3557109
> 
> Can you get a more detailed break down of where the time is spent. Also
> How come the number of reclaimed pages is so excessive comparing to the
> reclaim target? There is something fishy going on here.

I would say it's simply should_continue_reclaim() behaving similarly to
should_compact_retry(). We'll get compaction_suitable() returning
COMPACT_SKIPPED because the reclaimed pages have been immediately stolen,
and compaction indicates there's not enough base pages to begin with to form
a high-order pages. Since the stolen pages will appear on inactive lru, it
seems to be worth continuing reclaim to make enough free base pages for
compaction to no longer be skipped, because "inactive_lru_pages >
pages_for_compaction" is true.

So, both should_continue_reclaim() and should_compact_retry() are unable to
recognize that reclaimed pages are being stolen and limit the retries in
that case. The scenario seems to be uncommon, otherwise we'd be getting more
reports of that.
Mike Kravetz Sept. 9, 2021, 9:31 p.m. UTC | #23
On 9/9/21 6:45 AM, Vlastimil Babka wrote:
> On 9/9/21 13:54, Michal Hocko wrote:
>> On Wed 08-09-21 14:00:19, Mike Kravetz wrote:
>>> On 9/7/21 1:50 AM, Hillf Danton wrote:
>>>> On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote:
>>>>
>>>> And/or clamp reclaim retries for costly orders
>>>>
>>>> 	reclaim retries = MAX_RECLAIM_RETRIES - order;
>>>>
>>>> to pull down the chance for stall as low as possible.
>>>
>>> Thanks, and sorry for not replying quickly.  I only get back to this as
>>> time allows.
>>>
>>> We could clamp the number of compaction and reclaim retries in
>>> __alloc_pages_slowpath as suggested.  However, I noticed that a single
>>> reclaim call could take a bunch of time.  As a result, I instrumented
>>> shrink_node to see what might be happening.  Here is some information
>>> from a long stall.  Note that I only dump stats when jiffies > 100000.
>>>
>>> [ 8136.874706] shrink_node: 507654 total jiffies,  3557110 tries
>>> [ 8136.881130]              130596341 reclaimed, 32 nr_to_reclaim
>>> [ 8136.887643]              compaction_suitable results:
>>> [ 8136.893276]     idx COMPACT_SKIPPED, 3557109
>>
>> Can you get a more detailed break down of where the time is spent. Also
>> How come the number of reclaimed pages is so excessive comparing to the
>> reclaim target? There is something fishy going on here.
> 
> I would say it's simply should_continue_reclaim() behaving similarly to
> should_compact_retry(). We'll get compaction_suitable() returning
> COMPACT_SKIPPED because the reclaimed pages have been immediately stolen,
> and compaction indicates there's not enough base pages to begin with to form
> a high-order pages. Since the stolen pages will appear on inactive lru, it
> seems to be worth continuing reclaim to make enough free base pages for
> compaction to no longer be skipped, because "inactive_lru_pages >
> pages_for_compaction" is true.
> 
> So, both should_continue_reclaim() and should_compact_retry() are unable to
> recognize that reclaimed pages are being stolen and limit the retries in
> that case. The scenario seems to be uncommon, otherwise we'd be getting more
> reports of that.

Yes, I believe this is what is happening.

I honestly do not know if my test/recreation scenario is realistic.

I do know that our DB team has had issues with allocating a number of
hugetlb pages (after much uptime) taking forever or a REALLY long time.
These are all 2MB huge page allocations, so going through the normal
page allocation code path.  No idea what else is running on the system
at the time of the allocation stalls.  Unfortunately, this can not be
reproduced at will in their environment.  As a result, I have no data
and just this brief description of the issue.  When I stumbled on an
easy way to recreate, I thought it would be worth investigating/fixing.

It certainly does not seem to be a common scenario.
Michal Hocko Sept. 10, 2021, 8:20 a.m. UTC | #24
On Thu 09-09-21 15:45:45, Vlastimil Babka wrote:
> On 9/9/21 13:54, Michal Hocko wrote:
> > On Wed 08-09-21 14:00:19, Mike Kravetz wrote:
> >> On 9/7/21 1:50 AM, Hillf Danton wrote:
> >> > On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote:
> >> > 
> >> > And/or clamp reclaim retries for costly orders
> >> > 
> >> > 	reclaim retries = MAX_RECLAIM_RETRIES - order;
> >> > 
> >> > to pull down the chance for stall as low as possible.
> >> 
> >> Thanks, and sorry for not replying quickly.  I only get back to this as
> >> time allows.
> >> 
> >> We could clamp the number of compaction and reclaim retries in
> >> __alloc_pages_slowpath as suggested.  However, I noticed that a single
> >> reclaim call could take a bunch of time.  As a result, I instrumented
> >> shrink_node to see what might be happening.  Here is some information
> >> from a long stall.  Note that I only dump stats when jiffies > 100000.
> >> 
> >> [ 8136.874706] shrink_node: 507654 total jiffies,  3557110 tries
> >> [ 8136.881130]              130596341 reclaimed, 32 nr_to_reclaim
> >> [ 8136.887643]              compaction_suitable results:
> >> [ 8136.893276]     idx COMPACT_SKIPPED, 3557109
> > 
> > Can you get a more detailed break down of where the time is spent. Also
> > How come the number of reclaimed pages is so excessive comparing to the
> > reclaim target? There is something fishy going on here.
> 
> I would say it's simply should_continue_reclaim() behaving similarly to
> should_compact_retry(). We'll get compaction_suitable() returning
> COMPACT_SKIPPED because the reclaimed pages have been immediately stolen,
> and compaction indicates there's not enough base pages to begin with to form
> a high-order pages. Since the stolen pages will appear on inactive lru, it
> seems to be worth continuing reclaim to make enough free base pages for
> compaction to no longer be skipped, because "inactive_lru_pages >
> pages_for_compaction" is true.
> 
> So, both should_continue_reclaim() and should_compact_retry() are unable to
> recognize that reclaimed pages are being stolen and limit the retries in
> that case. The scenario seems to be uncommon, otherwise we'd be getting more
> reports of that.

Thanks this sounds like a viable explanation. For some reason I have
thought that a single reclaim round can take that much time but reading
again it seems like a cumulative thing.

I do agree that we should consider the above scenario when bailing out.
It is simply unreasonable to reclaim so much memory without any forward
progress.

A more robust way to address this problem (which is not really new)
would be to privatize reclaimed pages in the direct reclaim context and
reuse them in the compaction so that it doesn't have to just
pro-actively reclaim to keep watermarks good enough. A normal low order
requests would benefit from that as well because the reclaim couldn't be
stolen from them as well.

Another approach would be to detect the situation and treat reclaim
success which immediatelly leads to compaction deferral due to
watermarks as a failure.
Mike Kravetz Sept. 11, 2021, 12:11 a.m. UTC | #25
On 9/10/21 1:20 AM, Michal Hocko wrote:
> On Thu 09-09-21 15:45:45, Vlastimil Babka wrote:
>>
>> I would say it's simply should_continue_reclaim() behaving similarly to
>> should_compact_retry(). We'll get compaction_suitable() returning
>> COMPACT_SKIPPED because the reclaimed pages have been immediately stolen,
>> and compaction indicates there's not enough base pages to begin with to form
>> a high-order pages. Since the stolen pages will appear on inactive lru, it
>> seems to be worth continuing reclaim to make enough free base pages for
>> compaction to no longer be skipped, because "inactive_lru_pages >
>> pages_for_compaction" is true.
>>
>> So, both should_continue_reclaim() and should_compact_retry() are unable to
>> recognize that reclaimed pages are being stolen and limit the retries in
>> that case. The scenario seems to be uncommon, otherwise we'd be getting more
>> reports of that.
> 
> Thanks this sounds like a viable explanation. For some reason I have
> thought that a single reclaim round can take that much time but reading
> again it seems like a cumulative thing.
> 
> I do agree that we should consider the above scenario when bailing out.
> It is simply unreasonable to reclaim so much memory without any forward
> progress.

A very simple patch to bail early eliminated the LONG stalls and
decreased the time of a bulk allocation request.

Recall, the use case is converting 1G huge pages to the corresponding
amount of 2MB huge pages.  I have been told that 50GB is not an uncommon
amount of pages to 'convert'.  So, test case is free 50GB hugetlb pages
followed by allocate 25600 2MB hugetlb pages.  I have not done enough
runs to get anything statistically valid, but here are some ballpark
numbers.

Unmodified baseline:
-------------------
Unexpected number of 2MB pages after demote
   Expected 25600, have 19944

real    0m42.092s
user    0m0.008s
sys     0m41.467s

Retries capped by patch below:
------------------------------
Unexpected number of 2MB pages after demote
   Expected 25600, have 19395

real    0m12.951s
user    0m0.010s
sys     0m12.840s


The time of the operation is certainly better, but do note that we
allocated 549 fewer pages.  So, bailing early might cause some failures
if we continue trying.  It is all a tradeoff.  One of the reasons for
considering demote functionality is that the conversion would be done in
the hugetlb code without getting the page allocator, recalim,
compaction, etc involved.

> A more robust way to address this problem (which is not really new)
> would be to privatize reclaimed pages in the direct reclaim context and
> reuse them in the compaction so that it doesn't have to just
> pro-actively reclaim to keep watermarks good enough. A normal low order
> requests would benefit from that as well because the reclaim couldn't be
> stolen from them as well.

That does sound interesting.  Perhaps a longer term project.

> Another approach would be to detect the situation and treat reclaim
> success which immediatelly leads to compaction deferral due to
> watermarks as a failure.

I'll look into detecting and responding to this.

Simple limit retries patch:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eeb3a9c..16f3055 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4896,6 +4896,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	int no_progress_loops;
 	unsigned int cpuset_mems_cookie;
 	int reserve_flags;
+	unsigned long tries = 0, max_tries = 0;
 
 	/*
 	 * We also sanity check to catch abuse of atomic reserves being used by
@@ -4904,6 +4905,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
 				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
 		gfp_mask &= ~__GFP_ATOMIC;
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		max_tries = 1Ul << order;
 
 retry_cpuset:
 	compaction_retries = 0;
@@ -4996,6 +4999,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	}
 
 retry:
+	tries++;
 	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
 	if (alloc_flags & ALLOC_KSWAPD)
 		wake_all_kswapds(order, gfp_mask, ac);
@@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	if (did_some_progress > 0 &&
 			should_compact_retry(ac, order, alloc_flags,
 				compact_result, &compact_priority,
-				&compaction_retries))
+				&compaction_retries)) {
+		/*
+		 * In one pathological case, pages can be stolen immediately
+		 * after reclaimed.  It looks like we are making progress, and
+		 * compaction_retries is not incremented.  This could cause
+		 * an indefinite number of retries.  Cap the number of retries
+		 * for costly orders.
+		 */
+		if (max_tries && tries > max_tries)
+			goto nopage;
 		goto retry;
+	}
 
 
 	/* Deal with possible cpuset update races before we start OOM killing */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeae2f6..519e16e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2894,10 +2894,14 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	struct lruvec *target_lruvec;
 	bool reclaimable = false;
 	unsigned long file;
+	unsigned long tries = 0, max_tries = 0;
 
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+		max_tries = 1UL << sc->order;
 
 again:
+	tries++;
 	memset(&sc->nr, 0, sizeof(sc->nr));
 
 	nr_reclaimed = sc->nr_reclaimed;
@@ -3066,9 +3070,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
 
 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
-				    sc))
+				    sc)) {
+		/*
+		 * In one pathological case, pages can be stolen immediately
+		 * after being reclaimed.  This can cause an indefinite number
+		 * of retries.  Limit the number of retries for costly orders.
+		 */
+		if (!current_is_kswapd() &&
+		    max_tries && tries > max_tries &&
+		    sc->nr_reclaimed > sc->nr_to_reclaim)
+			goto done;
 		goto again;
+	}
 
+done:
 	/*
 	 * Kswapd gives up on balancing particular nodes after too
 	 * many failures to reclaim anything from them and goes to
Hillf Danton Sept. 11, 2021, 3:11 a.m. UTC | #26
On Fri, 10 Sep 2021 17:11:05 -0700 Mike Kravetz wrote:
>On 9/10/21 1:20 AM, Michal Hocko wrote:
>> On Thu 09-09-21 15:45:45, Vlastimil Babka wrote:
>>>
>>> I would say it's simply should_continue_reclaim() behaving similarly to
>>> should_compact_retry(). We'll get compaction_suitable() returning
>>> COMPACT_SKIPPED because the reclaimed pages have been immediately stolen,
>>> and compaction indicates there's not enough base pages to begin with to form
>>> a high-order pages. Since the stolen pages will appear on inactive lru, it
>>> seems to be worth continuing reclaim to make enough free base pages for
>>> compaction to no longer be skipped, because "inactive_lru_pages >
>>> pages_for_compaction" is true.
>>>
>>> So, both should_continue_reclaim() and should_compact_retry() are unable to
>>> recognize that reclaimed pages are being stolen and limit the retries in
>>> that case. The scenario seems to be uncommon, otherwise we'd be getting more
>>> reports of that.
>> 
>> Thanks this sounds like a viable explanation. For some reason I have
>> thought that a single reclaim round can take that much time but reading
>> again it seems like a cumulative thing.
>> 
>> I do agree that we should consider the above scenario when bailing out.
>> It is simply unreasonable to reclaim so much memory without any forward
>> progress.
>
>A very simple patch to bail early eliminated the LONG stalls and
>decreased the time of a bulk allocation request.
>
>Recall, the use case is converting 1G huge pages to the corresponding
>amount of 2MB huge pages.  I have been told that 50GB is not an uncommon
>amount of pages to 'convert'.  So, test case is free 50GB hugetlb pages
>followed by allocate 25600 2MB hugetlb pages.  I have not done enough
>runs to get anything statistically valid, but here are some ballpark
>numbers.
>
>Unmodified baseline:
>-------------------
>Unexpected number of 2MB pages after demote
>   Expected 25600, have 19944
>
>real    0m42.092s
>user    0m0.008s
>sys     0m41.467s
>
>Retries capped by patch below:
>------------------------------
>Unexpected number of 2MB pages after demote
>   Expected 25600, have 19395
>
>real    0m12.951s
>user    0m0.010s
>sys     0m12.840s
>
>
>The time of the operation is certainly better, but do note that we
>allocated 549 fewer pages.  So, bailing early might cause some failures
>if we continue trying.  It is all a tradeoff.

A hard one really to reach with all parties considerations met.

>One of the reasons for
>considering demote functionality is that the conversion would be done in
>the hugetlb code without getting the page allocator, recalim,
>compaction, etc involved.

Particularly when this involvement is a must have.

Given the role of stolen pages in the stall, a bisect run is throttle
the concurrent bulk allocators before a hammer on the stall.

static int hugetlb_2M_page_allocators;
static DECLARE_WAIT_QUEUE_HEAD(hugetlb_2M_page_allocators_wq);
static DEFINE_MUTEX(hugetlb_2M_page_allocators_mutex);

static bool inc_hugetlb_2M_page_allocators(void)
{
	bool rc = false;

	mutex_lock(&hugetlb_2M_page_allocators_mutex);
	if (hugetlb_2M_page_allocators < 2) {
		rc = true;
		++hugetlb_2M_page_allocators;
	}
	mutex_unlock(&hugetlb_2M_page_allocators_mutex);

	return rc;
}

static void dec_hugetlb_2M_page_allocators(void)
{
	mutex_lock(&hugetlb_2M_page_allocators_mutex);
	--hugetlb_2M_page_allocators;
	wake_up(&hugetlb_2M_page_allocators_wq);
	mutex_unlock(&hugetlb_2M_page_allocators_mutex);
}

static struct page *throttle_hugetlb_2M_page_allocators(void)
{
	struct page *page;

	wait_event(hugetlb_2M_page_allocators_wq,
		   true == inc_hugetlb_2M_page_allocators());

	page = __alloc_pages_nodemask(gfp_mask, order2M, nid, nmask);

	dec_hugetlb_2M_page_allocators();
	return page;
}

>
>> A more robust way to address this problem (which is not really new)
>> would be to privatize reclaimed pages in the direct reclaim context and
>> reuse them in the compaction so that it doesn't have to just
>> pro-actively reclaim to keep watermarks good enough. A normal low order
>> requests would benefit from that as well because the reclaim couldn't be
>> stolen from them as well.
>
>That does sound interesting.  Perhaps a longer term project.
>
>> Another approach would be to detect the situation and treat reclaim
>> success which immediatelly leads to compaction deferral due to
>> watermarks as a failure.
>
>I'll look into detecting and responding to this.
>
>Simple limit retries patch:
>
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index eeb3a9c..16f3055 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -4896,6 +4896,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> 	int no_progress_loops;
> 	unsigned int cpuset_mems_cookie;
> 	int reserve_flags;
>+	unsigned long tries = 0, max_tries = 0;
> 
> 	/*
> 	 * We also sanity check to catch abuse of atomic reserves being used by
>@@ -4904,6 +4905,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> 	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
> 				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> 		gfp_mask &= ~__GFP_ATOMIC;
>+	if (order > PAGE_ALLOC_COSTLY_ORDER)
>+		max_tries = 1Ul << order;
> 
> retry_cpuset:
> 	compaction_retries = 0;
>@@ -4996,6 +4999,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> 	}
> 
> retry:
>+	tries++;
> 	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
> 	if (alloc_flags & ALLOC_KSWAPD)
> 		wake_all_kswapds(order, gfp_mask, ac);
>@@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> 	if (did_some_progress > 0 &&
> 			should_compact_retry(ac, order, alloc_flags,
> 				compact_result, &compact_priority,
>-				&compaction_retries))
>+				&compaction_retries)) {
>+		/*
>+		 * In one pathological case, pages can be stolen immediately
>+		 * after reclaimed.  It looks like we are making progress, and
>+		 * compaction_retries is not incremented.  This could cause
>+		 * an indefinite number of retries.  Cap the number of retries
>+		 * for costly orders.
>+		 */
>+		if (max_tries && tries > max_tries)
>+			goto nopage;
> 		goto retry;
>+	}
> 
> 
> 	/* Deal with possible cpuset update races before we start OOM killing */
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index eeae2f6..519e16e 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -2894,10 +2894,14 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> 	struct lruvec *target_lruvec;
> 	bool reclaimable = false;
> 	unsigned long file;
>+	unsigned long tries = 0, max_tries = 0;
> 
> 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
>+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
>+		max_tries = 1UL << sc->order;
> 
> again:
>+	tries++;
> 	memset(&sc->nr, 0, sizeof(sc->nr));
> 
> 	nr_reclaimed = sc->nr_reclaimed;
>@@ -3066,9 +3070,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> 		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
> 
> 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>-				    sc))
>+				    sc)) {
>+		/*
>+		 * In one pathological case, pages can be stolen immediately
>+		 * after being reclaimed.  This can cause an indefinite number
>+		 * of retries.  Limit the number of retries for costly orders.
>+		 */
>+		if (!current_is_kswapd() &&
>+		    max_tries && tries > max_tries &&
>+		    sc->nr_reclaimed > sc->nr_to_reclaim)
>+			goto done;
> 		goto again;
>+	}
> 
>+done:
> 	/*
> 	 * Kswapd gives up on balancing particular nodes after too
> 	 * many failures to reclaim anything from them and goes to
>
>
>-- 
>Mike Kravetz
Michal Hocko Sept. 13, 2021, 3:50 p.m. UTC | #27
On Fri 10-09-21 17:11:05, Mike Kravetz wrote:
[...]
> @@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	if (did_some_progress > 0 &&
>  			should_compact_retry(ac, order, alloc_flags,
>  				compact_result, &compact_priority,
> -				&compaction_retries))
> +				&compaction_retries)) {
> +		/*
> +		 * In one pathological case, pages can be stolen immediately
> +		 * after reclaimed.  It looks like we are making progress, and
> +		 * compaction_retries is not incremented.  This could cause
> +		 * an indefinite number of retries.  Cap the number of retries
> +		 * for costly orders.
> +		 */
> +		if (max_tries && tries > max_tries)
> +			goto nopage;
>  		goto retry;
> +	}

I do not think this is a good approach. We do not want to play with
retries numbers. If we want to go with a minimal change for now then the
compaction feedback mechanism should track the number of reclaimed pages
to satisfy watermarks and if that grows beyond reasonable (proportionaly
to the request size) then simply give up rather than keep trying again
and again.
Mike Kravetz Sept. 15, 2021, 4:57 p.m. UTC | #28
On 9/13/21 8:50 AM, Michal Hocko wrote:
> On Fri 10-09-21 17:11:05, Mike Kravetz wrote:
> [...]
>> @@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>>  	if (did_some_progress > 0 &&
>>  			should_compact_retry(ac, order, alloc_flags,
>>  				compact_result, &compact_priority,
>> -				&compaction_retries))
>> +				&compaction_retries)) {
>> +		/*
>> +		 * In one pathological case, pages can be stolen immediately
>> +		 * after reclaimed.  It looks like we are making progress, and
>> +		 * compaction_retries is not incremented.  This could cause
>> +		 * an indefinite number of retries.  Cap the number of retries
>> +		 * for costly orders.
>> +		 */
>> +		if (max_tries && tries > max_tries)
>> +			goto nopage;
>>  		goto retry;
>> +	}
> 
> I do not think this is a good approach. We do not want to play with
> retries numbers. If we want to go with a minimal change for now then the
> compaction feedback mechanism should track the number of reclaimed pages
> to satisfy watermarks and if that grows beyond reasonable (proportionaly
> to the request size) then simply give up rather than keep trying again
> and again.

I am experimenting with code similar to the patch below.  The idea is to
key off of 'COMPACT_SKIPPED' being returned.  This implies that not enough
pages are available for compaction.  One of the values in this calculation
is compact_gap() which is scaled based on allocation order.  The simple
patch is to give up if number of reclaimed pages is greater than
compact_gap().  The thought is that this implies pages are being stolen
before compaction.

Such a patch does help in my specific test.  However, the concern is
that giving up earlier may regress some workloads.  One could of course
come up with a scenario where one more retry would result in success.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eeb3a9c..f8e3b0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4413,6 +4413,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 
 static inline bool
 should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
+		     unsigned long nr_reclaimed,
 		     enum compact_result compact_result,
 		     enum compact_priority *compact_priority,
 		     int *compaction_retries)
@@ -4445,7 +4446,16 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	 * to work with, so we retry only if it looks like reclaim can help.
 	 */
 	if (compaction_needs_reclaim(compact_result)) {
-		ret = compaction_zonelist_suitable(ac, order, alloc_flags);
+		/*
+		 * If we reclaimed enough pages, but they are not available
+		 * now, this implies they were stolen.  Do not reclaim again
+		 * as this could go on indefinitely.
+		 */
+		if (nr_reclaimed > compact_gap(order))
+			ret = false;
+		else
+			ret = compaction_zonelist_suitable(ac, order,
+								alloc_flags);
 		goto out;
 	}
 
@@ -4504,6 +4514,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 static inline bool
 should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
 		     enum compact_result compact_result,
+		     unsigned long nr_reclaimed,
 		     enum compact_priority *compact_priority,
 		     int *compaction_retries)
 {
@@ -4890,6 +4901,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	struct page *page = NULL;
 	unsigned int alloc_flags;
 	unsigned long did_some_progress;
+	unsigned long nr_reclaimed;
 	enum compact_priority compact_priority;
 	enum compact_result compact_result;
 	int compaction_retries;
@@ -5033,6 +5045,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 							&did_some_progress);
 	if (page)
 		goto got_pg;
+	nr_reclaimed = did_some_progress;
 
 	/* Try direct compaction and then allocating */
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
@@ -5063,7 +5076,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	 */
 	if (did_some_progress > 0 &&
 			should_compact_retry(ac, order, alloc_flags,
-				compact_result, &compact_priority,
+				nr_reclaimed, compact_result, &compact_priority,
 				&compaction_retries))
 		goto retry;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeae2f6..d40eca5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2819,10 +2819,18 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 	}
 
 	/*
+	 * If we have reclaimed enough pages for compaction, and compaction
+	 * is not ready, this implies pages are being stolen as they are being
+	 * reclaimed.  Quit now instead of continual retrying.
+	 */
+	pages_for_compaction = compact_gap(sc->order);
+	if (!current_is_kswapd() && sc->nr_reclaimed > pages_for_compaction)
+		return false;
+
+	/*
 	 * If we have not reclaimed enough pages for compaction and the
 	 * inactive lists are large enough, continue reclaiming
 	 */
-	pages_for_compaction = compact_gap(sc->order);
 	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
 	if (get_nr_swap_pages() > 0)
 		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
Mike Kravetz Sept. 17, 2021, 8:44 p.m. UTC | #29
On 9/15/21 9:57 AM, Mike Kravetz wrote:
> On 9/13/21 8:50 AM, Michal Hocko wrote:
>> I do not think this is a good approach. We do not want to play with
>> retries numbers. If we want to go with a minimal change for now then the
>> compaction feedback mechanism should track the number of reclaimed pages
>> to satisfy watermarks and if that grows beyond reasonable (proportionaly
>> to the request size) then simply give up rather than keep trying again
>> and again.
> 
> I am experimenting with code similar to the patch below.  The idea is to
> key off of 'COMPACT_SKIPPED' being returned.  This implies that not enough
> pages are available for compaction.  One of the values in this calculation
> is compact_gap() which is scaled based on allocation order.  The simple
> patch is to give up if number of reclaimed pages is greater than
> compact_gap().  The thought is that this implies pages are being stolen
> before compaction.
> 
> Such a patch does help in my specific test.  However, the concern is
> that giving up earlier may regress some workloads.  One could of course
> come up with a scenario where one more retry would result in success.

With the patch below, I instrumented the number of times shrink_node and
__alloc_pages_slowpath would 'quit earlier than before'.  In my test (free
1GB pages, allocate 2MB pages), this early exit was exercised a significant
number of times:

shrink_node:
        931124 quick exits

__alloc_pages_slowpath:
        bail early 32 times

For comparison, I ran the LTP mm tests mostly because they include a few
OOM test scenarios.  Test results were unchanged, and numbers for
exiting early were much lower than my hugetlb test:

shrink_node:
        57 quick exits

__alloc_pages_slowpath:
        bail early 0 times

This at least 'looks' like the changes target the page stealing corner
case in my test, and minimally impact other scenarios.

Any suggestions for other tests to run, or tweaks to the code?