mbox series

[RFC,0/3] hugetlb: add demote/split page functionality

Message ID 20210309001855.142453-1-mike.kravetz@oracle.com (mailing list archive)
Headers show
Series hugetlb: add demote/split page functionality | expand

Message

Mike Kravetz March 9, 2021, 12:18 a.m. UTC
The concurrent use of multiple hugetlb page sizes on a single system
is becoming more common.  One of the reasons is better TLB support for
gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
being used to back VMs in hosting environments.

When using hugetlb pages to back VMs in such environments, it is
sometimes desirable to preallocate hugetlb pools.  This avoids the delay
and uncertainty of allocating hugetlb pages at VM startup.  In addition,
preallocating huge pages minimizes the issue of memory fragmentation that
increases the longer the system is up and running.

In such environments, a combination of larger and smaller hugetlb pages
are preallocated in anticipation of backing VMs of various sizes.  Over
time, the preallocated pool of smaller hugetlb pages may become
depleted while larger hugetlb pages still remain.  In such situations,
it may be desirable to convert larger hugetlb pages to smaller hugetlb
pages.

Converting larger to smaller hugetlb pages can be accomplished today by
first freeing the larger page to the buddy allocator and then allocating
the smaller pages.  However, there are two issues with this approach:
1) This process can take quite some time, especially if allocation of
   the smaller pages is not immediate and requires migration/compaction.
2) There is no guarantee that the total size of smaller pages allocated
   will match the size of the larger page which was freed.  This is
   because the area freed by the larger page could quickly be
   fragmented.

To address these issues, introduce the concept of hugetlb page demotion.
Demotion provides a means of 'in place' splitting a hugetlb page to
pages of a smaller size.  For example, on x86 one 1G page can be
demoted to 512 2M pages.  Page demotion is controlled via sysfs files.
- demote_size	Read only target page size for demotion
- demote	Writable number of hugetlb pages to be demoted

Only hugetlb pages which are free at the time of the request can be demoted.
Demotion does not add to the complexity surplus pages.  Demotion also honors
reserved huge pages.  Therefore, when a value is written to the sysfs demote
file that value is only the maximum number of pages which will be demoted.
It is possible fewer will actually be demoted.

If demote_size is PAGESIZE, demote will simply free pages to the buddy
allocator.

Mike Kravetz (3):
  hugetlb: add demote hugetlb page sysfs interfaces
  hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA
  hugetlb: add hugetlb demote page support

 include/linux/hugetlb.h |   8 ++
 mm/hugetlb.c            | 199 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 204 insertions(+), 3 deletions(-)

Comments

David Hildenbrand March 9, 2021, 9:01 a.m. UTC | #1
On 09.03.21 01:18, Mike Kravetz wrote:
> The concurrent use of multiple hugetlb page sizes on a single system
> is becoming more common.  One of the reasons is better TLB support for
> gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
> being used to back VMs in hosting environments.
> 
> When using hugetlb pages to back VMs in such environments, it is
> sometimes desirable to preallocate hugetlb pools.  This avoids the delay
> and uncertainty of allocating hugetlb pages at VM startup.  In addition,
> preallocating huge pages minimizes the issue of memory fragmentation that
> increases the longer the system is up and running.
> 
> In such environments, a combination of larger and smaller hugetlb pages
> are preallocated in anticipation of backing VMs of various sizes.  Over
> time, the preallocated pool of smaller hugetlb pages may become
> depleted while larger hugetlb pages still remain.  In such situations,
> it may be desirable to convert larger hugetlb pages to smaller hugetlb
> pages.
> 
> Converting larger to smaller hugetlb pages can be accomplished today by
> first freeing the larger page to the buddy allocator and then allocating
> the smaller pages.  However, there are two issues with this approach:
> 1) This process can take quite some time, especially if allocation of
>     the smaller pages is not immediate and requires migration/compaction.
> 2) There is no guarantee that the total size of smaller pages allocated
>     will match the size of the larger page which was freed.  This is
>     because the area freed by the larger page could quickly be
>     fragmented.
> 
> To address these issues, introduce the concept of hugetlb page demotion.
> Demotion provides a means of 'in place' splitting a hugetlb page to
> pages of a smaller size.  For example, on x86 one 1G page can be
> demoted to 512 2M pages.  Page demotion is controlled via sysfs files.
> - demote_size	Read only target page size for demotion
> - demote	Writable number of hugetlb pages to be demoted
> 
> Only hugetlb pages which are free at the time of the request can be demoted.
> Demotion does not add to the complexity surplus pages.  Demotion also honors
> reserved huge pages.  Therefore, when a value is written to the sysfs demote
> file that value is only the maximum number of pages which will be demoted.
> It is possible fewer will actually be demoted.
> 
> If demote_size is PAGESIZE, demote will simply free pages to the buddy
> allocator.

With the vmemmap optimizations you will have to rework the vmemmap 
layout. How is that handled? Couldn't it happen that you are half-way 
through splitting a PUD into PMDs when you realize that you cannot 
allocate vmemmap pages for properly handling the remaining PMDs? What 
would happen then?

Or are you planning on making both features mutually exclusive?

Of course, one approach would be first completely restoring the vmemmap 
for the whole PUD (allocating more pages than necessary in the end) and 
then freeing individual pages again when optimizing the layout per PMD.
Mike Kravetz March 9, 2021, 5:11 p.m. UTC | #2
On 3/9/21 1:01 AM, David Hildenbrand wrote:
> On 09.03.21 01:18, Mike Kravetz wrote:
>> To address these issues, introduce the concept of hugetlb page demotion.
>> Demotion provides a means of 'in place' splitting a hugetlb page to
>> pages of a smaller size.  For example, on x86 one 1G page can be
>> demoted to 512 2M pages.  Page demotion is controlled via sysfs files.
>> - demote_size    Read only target page size for demotion
>> - demote    Writable number of hugetlb pages to be demoted
>>
>> Only hugetlb pages which are free at the time of the request can be demoted.
>> Demotion does not add to the complexity surplus pages.  Demotion also honors
>> reserved huge pages.  Therefore, when a value is written to the sysfs demote
>> file that value is only the maximum number of pages which will be demoted.
>> It is possible fewer will actually be demoted.
>>
>> If demote_size is PAGESIZE, demote will simply free pages to the buddy
>> allocator.
> 
> With the vmemmap optimizations you will have to rework the vmemmap layout. How is that handled? Couldn't it happen that you are half-way through splitting a PUD into PMDs when you realize that you cannot allocate vmemmap pages for properly handling the remaining PMDs? What would happen then?
> 
> Or are you planning on making both features mutually exclusive?
> 
> Of course, one approach would be first completely restoring the vmemmap for the whole PUD (allocating more pages than necessary in the end) and then freeing individual pages again when optimizing the layout per PMD.
> 

You are right about the need to address this issue.  Patch 3 has the
comment:

+	/*
+	 * Note for future:
+	 * When support for reducing vmemmap of huge pages is added, we
+	 * will need to allocate vmemmap pages here and could fail.
+	 */

The simplest approach would be to restore the entire vmemmmap for the
larger page and then delete for smaller pages after the split.  We could
hook into the existing vmemmmap reduction code with just a few calls.
This would fail to demote/split, if the allocation fails.  However, this
is not optimal.

Ideally, the code would compute how many pages for vmemmmap are needed
after the split, allocate those and then construct vmmemmap
appropriately when creating the smaller pages.

I think we would want to always do the allocation of vmmemmap pages up
front and not even start the split process if the allocation fails.  No
sense starting something we may not be able to finish.

I purposely did not address that here as first I wanted to get feedback
on the usefulness demote functionality.
David Hildenbrand March 9, 2021, 5:50 p.m. UTC | #3
On 09.03.21 18:11, Mike Kravetz wrote:
> On 3/9/21 1:01 AM, David Hildenbrand wrote:
>> On 09.03.21 01:18, Mike Kravetz wrote:
>>> To address these issues, introduce the concept of hugetlb page demotion.
>>> Demotion provides a means of 'in place' splitting a hugetlb page to
>>> pages of a smaller size.  For example, on x86 one 1G page can be
>>> demoted to 512 2M pages.  Page demotion is controlled via sysfs files.
>>> - demote_size    Read only target page size for demotion
>>> - demote    Writable number of hugetlb pages to be demoted
>>>
>>> Only hugetlb pages which are free at the time of the request can be demoted.
>>> Demotion does not add to the complexity surplus pages.  Demotion also honors
>>> reserved huge pages.  Therefore, when a value is written to the sysfs demote
>>> file that value is only the maximum number of pages which will be demoted.
>>> It is possible fewer will actually be demoted.
>>>
>>> If demote_size is PAGESIZE, demote will simply free pages to the buddy
>>> allocator.
>>
>> With the vmemmap optimizations you will have to rework the vmemmap layout. How is that handled? Couldn't it happen that you are half-way through splitting a PUD into PMDs when you realize that you cannot allocate vmemmap pages for properly handling the remaining PMDs? What would happen then?
>>
>> Or are you planning on making both features mutually exclusive?
>>
>> Of course, one approach would be first completely restoring the vmemmap for the whole PUD (allocating more pages than necessary in the end) and then freeing individual pages again when optimizing the layout per PMD.
>>
> 
> You are right about the need to address this issue.  Patch 3 has the
> comment:
> 
> +	/*
> +	 * Note for future:
> +	 * When support for reducing vmemmap of huge pages is added, we
> +	 * will need to allocate vmemmap pages here and could fail.
> +	 */
> 

I only skimmed over the cover letter so far. :)

> The simplest approach would be to restore the entire vmemmmap for the
> larger page and then delete for smaller pages after the split.  We could
> hook into the existing vmemmmap reduction code with just a few calls.
> This would fail to demote/split, if the allocation fails.  However, this
> is not optimal.
> 
> Ideally, the code would compute how many pages for vmemmmap are needed
> after the split, allocate those and then construct vmmemmap
> appropriately when creating the smaller pages.
> 
> I think we would want to always do the allocation of vmmemmap pages up
> front and not even start the split process if the allocation fails.  No
> sense starting something we may not be able to finish.
> 

Makes sense.

Another case might also be interesting: Assume you allocated a gigantic 
page via CMA and denoted it to huge pages. Theoretically (after Oscar's 
series!), we could come back later and re-allocate a gigantic page via 
CMA, migrating all now-hugepages out of the CMA region. Would require 
telling CMA that that area is effectively no longer allocated via CMA 
(adjusting accounting, bitmaps, etc).

That would actually be a neat use case to form new gigantic pages later 
on when necessary :)

But I assume your primary use case is denoting gigantic pages allocated 
during boot, not via CMA.

Maybe you addresses that already as well :)

> I purposely did not address that here as first I wanted to get feedback
> on the usefulness demote functionality.
> 

Makes sense. I think there could be some value in having this 
functionality. Gigantic pages are rare and we might want to keep them as 
long as possible (and as long as we have sufficient free memory). But 
once we need huge pages (e.g., smaller VMs, different granularity 
requiremets), we could denote.

If we ever have pre-zeroing of huge/gigantic pages, your approach could 
also avoid having to zero huge pages again when the gigantic page was 
already zeroed.
Mike Kravetz March 9, 2021, 6:21 p.m. UTC | #4
On 3/9/21 9:50 AM, David Hildenbrand wrote:
> On 09.03.21 18:11, Mike Kravetz wrote:
>> On 3/9/21 1:01 AM, David Hildenbrand wrote:
>>> On 09.03.21 01:18, Mike Kravetz wrote:
>>>> To address these issues, introduce the concept of hugetlb page demotion.
>>>> Demotion provides a means of 'in place' splitting a hugetlb page to
>>>> pages of a smaller size.  For example, on x86 one 1G page can be
>>>> demoted to 512 2M pages.  Page demotion is controlled via sysfs files.
>>>> - demote_size    Read only target page size for demotion
>>>> - demote    Writable number of hugetlb pages to be demoted
>>>>
>>>> Only hugetlb pages which are free at the time of the request can be demoted.
>>>> Demotion does not add to the complexity surplus pages.  Demotion also honors
>>>> reserved huge pages.  Therefore, when a value is written to the sysfs demote
>>>> file that value is only the maximum number of pages which will be demoted.
>>>> It is possible fewer will actually be demoted.
>>>>
>>>> If demote_size is PAGESIZE, demote will simply free pages to the buddy
>>>> allocator.
>>>
>>> With the vmemmap optimizations you will have to rework the vmemmap layout. How is that handled? Couldn't it happen that you are half-way through splitting a PUD into PMDs when you realize that you cannot allocate vmemmap pages for properly handling the remaining PMDs? What would happen then?
>>>
>>> Or are you planning on making both features mutually exclusive?
>>>
>>> Of course, one approach would be first completely restoring the vmemmap for the whole PUD (allocating more pages than necessary in the end) and then freeing individual pages again when optimizing the layout per PMD.
>>>
>>
>> You are right about the need to address this issue.  Patch 3 has the
>> comment:
>>
>> +    /*
>> +     * Note for future:
>> +     * When support for reducing vmemmap of huge pages is added, we
>> +     * will need to allocate vmemmap pages here and could fail.
>> +     */
>>
> 
> I only skimmed over the cover letter so far. :)
> 
>> The simplest approach would be to restore the entire vmemmmap for the
>> larger page and then delete for smaller pages after the split.  We could
>> hook into the existing vmemmmap reduction code with just a few calls.
>> This would fail to demote/split, if the allocation fails.  However, this
>> is not optimal.
>>
>> Ideally, the code would compute how many pages for vmemmmap are needed
>> after the split, allocate those and then construct vmmemmap
>> appropriately when creating the smaller pages.
>>
>> I think we would want to always do the allocation of vmmemmap pages up
>> front and not even start the split process if the allocation fails.  No
>> sense starting something we may not be able to finish.
>>
> 
> Makes sense.
> 
> Another case might also be interesting: Assume you allocated a gigantic page via CMA and denoted it to huge pages. Theoretically (after Oscar's series!), we could come back later and re-allocate a gigantic page via CMA, migrating all now-hugepages out of the CMA region. Would require telling CMA that that area is effectively no longer allocated via CMA (adjusting accounting, bitmaps, etc).
> 

I need to take a close look at Oscar's patches.  Too many thing to look
at/review :)

This series does take into account gigantic pages allocated in CMA.
Such pages can be demoted, and we need to track that they need to go
back to CMA.  Nothing super special for this, mostly a new hugetlb
specific flag to track such pages.

> That would actually be a neat use case to form new gigantic pages later on when necessary :)
> 
> But I assume your primary use case is denoting gigantic pages allocated during boot, not via CMA.
> 
> Maybe you addresses that already as well :)

Yup.

> 
>> I purposely did not address that here as first I wanted to get feedback
>> on the usefulness demote functionality.
>>
> 
> Makes sense. I think there could be some value in having this functionality. Gigantic pages are rare and we might want to keep them as long as possible (and as long as we have sufficient free memory). But once we need huge pages (e.g., smaller VMs, different granularity requiremets), we could denote.
> 

That is exactly the use case of one our product groups.  Hoping to get
feedback from others who might be doing something similar.

> If we ever have pre-zeroing of huge/gigantic pages, your approach could also avoid having to zero huge pages again when the gigantic page was already zeroed.
> 

Having a user option to pre-zero hugetlb pages is also on my 'to do'
list.  We now have hugetlb specific page flags to help with tracking.
David Hildenbrand March 9, 2021, 7:01 p.m. UTC | #5
> I need to take a close look at Oscar's patches.  Too many thing to look
> at/review :)
> 
> This series does take into account gigantic pages allocated in CMA.
> Such pages can be demoted, and we need to track that they need to go
> back to CMA.  Nothing super special for this, mostly a new hugetlb
> specific flag to track such pages.

Ah, just spotted it - patch #2 :)

Took me a while to figure out that we end up calling 
cma_declare_contiguous_nid() with order_per_bit=0 - would have thought 
we would be using the actual smallest allocation order we end up using 
for huge/gigantic pages via CMA. Well, this way it "simply works".
Oscar Salvador March 10, 2021, 3:58 p.m. UTC | #6
On Mon, Mar 08, 2021 at 04:18:52PM -0800, Mike Kravetz wrote:
> The concurrent use of multiple hugetlb page sizes on a single system
> is becoming more common.  One of the reasons is better TLB support for
> gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
> being used to back VMs in hosting environments.
> 
> When using hugetlb pages to back VMs in such environments, it is
> sometimes desirable to preallocate hugetlb pools.  This avoids the delay
> and uncertainty of allocating hugetlb pages at VM startup.  In addition,
> preallocating huge pages minimizes the issue of memory fragmentation that
> increases the longer the system is up and running.
> 
> In such environments, a combination of larger and smaller hugetlb pages
> are preallocated in anticipation of backing VMs of various sizes.  Over
> time, the preallocated pool of smaller hugetlb pages may become
> depleted while larger hugetlb pages still remain.  In such situations,
> it may be desirable to convert larger hugetlb pages to smaller hugetlb
> pages.

Hi Mike,

The usecase sounds neat.

> 
> Converting larger to smaller hugetlb pages can be accomplished today by
> first freeing the larger page to the buddy allocator and then allocating
> the smaller pages.  However, there are two issues with this approach:
> 1) This process can take quite some time, especially if allocation of
>    the smaller pages is not immediate and requires migration/compaction.
> 2) There is no guarantee that the total size of smaller pages allocated
>    will match the size of the larger page which was freed.  This is
>    because the area freed by the larger page could quickly be
>    fragmented.
> 
> To address these issues, introduce the concept of hugetlb page demotion.
> Demotion provides a means of 'in place' splitting a hugetlb page to
> pages of a smaller size.  For example, on x86 one 1G page can be
> demoted to 512 2M pages.  Page demotion is controlled via sysfs files.
> - demote_size	Read only target page size for demotion

What about those archs where we have more than two hugetlb sizes?
IIRC, in powerpc you can have that, right?
If so, would it make sense for demote_size to be writable so you can pick
the size? 


> - demote	Writable number of hugetlb pages to be demoted

Below you mention that due to reservation, the amount of demoted pages can
be less than what the admin specified.
Would it make sense to have a place where someone can check how many pages got
actually demoted?
Or will this follow nr_hugepages' scheme and will always reflect the number of
current demoted pages?

> Only hugetlb pages which are free at the time of the request can be demoted.
> Demotion does not add to the complexity surplus pages.  Demotion also honors
> reserved huge pages.  Therefore, when a value is written to the sysfs demote
> file that value is only the maximum number of pages which will be demoted.
> It is possible fewer will actually be demoted.
> 
> If demote_size is PAGESIZE, demote will simply free pages to the buddy
> allocator.

Wrt. vmemmap discussion with David.
I also think we could compute how many vmemmap pages we are going to need to
re-shape the vmemmap layout and allocate those upfront.
And I think this approach would be just more simple.

I plan to have a look at the patches later today or tomorrow.

Thanks
Michal Hocko March 10, 2021, 4:23 p.m. UTC | #7
On Mon 08-03-21 16:18:52, Mike Kravetz wrote:
[...]
> Converting larger to smaller hugetlb pages can be accomplished today by
> first freeing the larger page to the buddy allocator and then allocating
> the smaller pages.  However, there are two issues with this approach:
> 1) This process can take quite some time, especially if allocation of
>    the smaller pages is not immediate and requires migration/compaction.
> 2) There is no guarantee that the total size of smaller pages allocated
>    will match the size of the larger page which was freed.  This is
>    because the area freed by the larger page could quickly be
>    fragmented.

I will likely not surprise to show some level of reservation. While your
concerns about reconfiguration by existing interfaces are quite real is
this really a problem in practice? How often do you need such a
reconfiguration?

Is this all really worth the additional code to something as tricky as
hugetlb code base?

>  include/linux/hugetlb.h |   8 ++
>  mm/hugetlb.c            | 199 +++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 204 insertions(+), 3 deletions(-)
> 
> -- 
> 2.29.2
>
Zi Yan March 10, 2021, 4:46 p.m. UTC | #8
On 10 Mar 2021, at 11:23, Michal Hocko wrote:

> On Mon 08-03-21 16:18:52, Mike Kravetz wrote:
> [...]
>> Converting larger to smaller hugetlb pages can be accomplished today by
>> first freeing the larger page to the buddy allocator and then allocating
>> the smaller pages.  However, there are two issues with this approach:
>> 1) This process can take quite some time, especially if allocation of
>>    the smaller pages is not immediate and requires migration/compaction.
>> 2) There is no guarantee that the total size of smaller pages allocated
>>    will match the size of the larger page which was freed.  This is
>>    because the area freed by the larger page could quickly be
>>    fragmented.
>
> I will likely not surprise to show some level of reservation. While your
> concerns about reconfiguration by existing interfaces are quite real is
> this really a problem in practice? How often do you need such a
> reconfiguration?
>
> Is this all really worth the additional code to something as tricky as
> hugetlb code base?
>
>>  include/linux/hugetlb.h |   8 ++
>>  mm/hugetlb.c            | 199 +++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 204 insertions(+), 3 deletions(-)
>>
>> -- 
>> 2.29.2
>>

The high level goal of this patchset seems to enable flexible huge page
allocation from a single pool, when multiple huge page sizes are available
to use. The limitation of existing mechanism is that user has to specify
how many huge pages he/she wants and how many gigantic pages he/she wants
before the actual use.

I just want to throw an idea here, please ignore if it is too crazy.
Could we have a variant buddy allocator for huge page allocations,
which only has available huge page orders in the free list? For example,
if user wants 2MB and 1GB pages, the allocator will only have order-9 and
order-19 pages; when order-9 pages run out, we can split order-19 pages;
if possible, adjacent order-9 pages can be merged back to order-19 pages.


—
Best Regards,
Yan Zi
Michal Hocko March 10, 2021, 5:05 p.m. UTC | #9
On Wed 10-03-21 11:46:57, Zi Yan wrote:
> On 10 Mar 2021, at 11:23, Michal Hocko wrote:
> 
> > On Mon 08-03-21 16:18:52, Mike Kravetz wrote:
> > [...]
> >> Converting larger to smaller hugetlb pages can be accomplished today by
> >> first freeing the larger page to the buddy allocator and then allocating
> >> the smaller pages.  However, there are two issues with this approach:
> >> 1) This process can take quite some time, especially if allocation of
> >>    the smaller pages is not immediate and requires migration/compaction.
> >> 2) There is no guarantee that the total size of smaller pages allocated
> >>    will match the size of the larger page which was freed.  This is
> >>    because the area freed by the larger page could quickly be
> >>    fragmented.
> >
> > I will likely not surprise to show some level of reservation. While your
> > concerns about reconfiguration by existing interfaces are quite real is
> > this really a problem in practice? How often do you need such a
> > reconfiguration?
> >
> > Is this all really worth the additional code to something as tricky as
> > hugetlb code base?
> >
> >>  include/linux/hugetlb.h |   8 ++
> >>  mm/hugetlb.c            | 199 +++++++++++++++++++++++++++++++++++++++-
> >>  2 files changed, 204 insertions(+), 3 deletions(-)
> >>
> >> -- 
> >> 2.29.2
> >>
> 
> The high level goal of this patchset seems to enable flexible huge page
> allocation from a single pool, when multiple huge page sizes are available
> to use. The limitation of existing mechanism is that user has to specify
> how many huge pages he/she wants and how many gigantic pages he/she wants
> before the actual use.

I believe I have understood this part. And I am not questioning that.
This seems useful. I am mostly asking whether we need such a
flexibility. Mostly because of the additional code and future
maintenance complexity which has turned to be a problem for a long time.
Each new feature tends to just add on top of the existing complexity.

> I just want to throw an idea here, please ignore if it is too crazy.
> Could we have a variant buddy allocator for huge page allocations,
> which only has available huge page orders in the free list? For example,
> if user wants 2MB and 1GB pages, the allocator will only have order-9 and
> order-19 pages; when order-9 pages run out, we can split order-19 pages;
> if possible, adjacent order-9 pages can be merged back to order-19 pages.

I assume you mean to remove those pages from the allocator when they
are reserved rather than really used, right? I am not really sure how
you want to deal with lower orders consuming/splitting too much from
higher orders which then makes those unusable for the use even though
they were preallocated for a specific workload. Another worry is that a
gap between 2MB and 1GB pages is just too big so a single 2MB request
from 1G pool will make the whole 1GB page unusable even when the smaller
pool needs few pages.
Zi Yan March 10, 2021, 5:36 p.m. UTC | #10
On 10 Mar 2021, at 12:05, Michal Hocko wrote:

> On Wed 10-03-21 11:46:57, Zi Yan wrote:
>> On 10 Mar 2021, at 11:23, Michal Hocko wrote:
>>
>>> On Mon 08-03-21 16:18:52, Mike Kravetz wrote:
>>> [...]
>>>> Converting larger to smaller hugetlb pages can be accomplished today by
>>>> first freeing the larger page to the buddy allocator and then allocating
>>>> the smaller pages.  However, there are two issues with this approach:
>>>> 1) This process can take quite some time, especially if allocation of
>>>>    the smaller pages is not immediate and requires migration/compaction.
>>>> 2) There is no guarantee that the total size of smaller pages allocated
>>>>    will match the size of the larger page which was freed.  This is
>>>>    because the area freed by the larger page could quickly be
>>>>    fragmented.
>>>
>>> I will likely not surprise to show some level of reservation. While your
>>> concerns about reconfiguration by existing interfaces are quite real is
>>> this really a problem in practice? How often do you need such a
>>> reconfiguration?
>>>
>>> Is this all really worth the additional code to something as tricky as
>>> hugetlb code base?
>>>
>>>>  include/linux/hugetlb.h |   8 ++
>>>>  mm/hugetlb.c            | 199 +++++++++++++++++++++++++++++++++++++++-
>>>>  2 files changed, 204 insertions(+), 3 deletions(-)
>>>>
>>>> -- 
>>>> 2.29.2
>>>>
>>
>> The high level goal of this patchset seems to enable flexible huge page
>> allocation from a single pool, when multiple huge page sizes are available
>> to use. The limitation of existing mechanism is that user has to specify
>> how many huge pages he/she wants and how many gigantic pages he/she wants
>> before the actual use.
>
> I believe I have understood this part. And I am not questioning that.
> This seems useful. I am mostly asking whether we need such a
> flexibility. Mostly because of the additional code and future
> maintenance complexity which has turned to be a problem for a long time.
> Each new feature tends to just add on top of the existing complexity.

I totally agree. This patchset looks to me like a partial functional
replication of splitting high order free pages to lower order ones in buddy
allocator. That is why I had the crazy idea below.

>
>> I just want to throw an idea here, please ignore if it is too crazy.
>> Could we have a variant buddy allocator for huge page allocations,
>> which only has available huge page orders in the free list? For example,
>> if user wants 2MB and 1GB pages, the allocator will only have order-9 and
>> order-19 pages; when order-9 pages run out, we can split order-19 pages;
>> if possible, adjacent order-9 pages can be merged back to order-19 pages.
>
> I assume you mean to remove those pages from the allocator when they
> are reserved rather than really used, right? I am not really sure how

No. The allocator maintains all the reserved pages for huge page allocations,
replacing existing cma_alloc or alloc_contig_pages. The kernel builds
the free list when pages are reserved either during boot time or runtime.

> you want to deal with lower orders consuming/splitting too much from
> higher orders which then makes those unusable for the use even though
> they were preallocated for a specific workload. Another worry is that a
> gap between 2MB and 1GB pages is just too big so a single 2MB request
> from 1G pool will make the whole 1GB page unusable even when the smaller
> pool needs few pages.

Yeah, the gap between 2MB and 1GB is large. The fragmentation will be
a problem. Maybe we do not need it right now, since this patchset does not
propose promoting/merging pages. Or we can reuse the existing
anti fragmentation mechanisms but with pageblock set to gigantic page size
in this pool.

I admit my idea is a much intrusive change, but I feel that more
functionality replications of core mm are added to hugetlb code, then why
not reuse the core mm code.


—
Best Regards,
Yan Zi
Mike Kravetz March 10, 2021, 7:45 p.m. UTC | #11
On 3/10/21 8:23 AM, Michal Hocko wrote:
> On Mon 08-03-21 16:18:52, Mike Kravetz wrote:
> [...]
>> Converting larger to smaller hugetlb pages can be accomplished today by
>> first freeing the larger page to the buddy allocator and then allocating
>> the smaller pages.  However, there are two issues with this approach:
>> 1) This process can take quite some time, especially if allocation of
>>    the smaller pages is not immediate and requires migration/compaction.
>> 2) There is no guarantee that the total size of smaller pages allocated
>>    will match the size of the larger page which was freed.  This is
>>    because the area freed by the larger page could quickly be
>>    fragmented.
> 
> I will likely not surprise to show some level of reservation. While your
> concerns about reconfiguration by existing interfaces are quite real is
> this really a problem in practice? How often do you need such a
> reconfiguration?

In reply to one of David's comments, I mentioned that we have a product
group with this use case today.  They use hugetlb pages to back VMs,
and preallocate a 'best guess' number of pages of each each order.  They
can only guess how many pages of each order are needed because they are
responding to dynamic requests for new VMs.

When they find themselves in this situation today, they free 1G pages to
buddy and try to allocate the corresponding number of 2M pages.  The
concerns above were mentioned/experienced by this group.

Part of the reason for the RFC was to see if others might have similar
use cases.  With newer x86 processors, I hear about more people using
1G hugetlb pages.  I also hear about people using hugetlb pages to back
VMs.  So, was thinking others may have similar use cases?

> Is this all really worth the additional code to something as tricky as
> hugetlb code base?
> 

The 'good news' is that this does not involve much tricky code.  It only
demotes free hugetlb pages.  Of course, it is only worth it the new code
is actually going to be used.  I know of at least one use case.
Mike Kravetz March 10, 2021, 7:56 p.m. UTC | #12
On 3/10/21 8:46 AM, Zi Yan wrote:
> The high level goal of this patchset seems to enable flexible huge page
> allocation from a single pool, when multiple huge page sizes are available
> to use. The limitation of existing mechanism is that user has to specify
> how many huge pages he/she wants and how many gigantic pages he/she wants
> before the actual use.
> 
> I just want to throw an idea here, please ignore if it is too crazy.
> Could we have a variant buddy allocator for huge page allocations,
> which only has available huge page orders in the free list? For example,
> if user wants 2MB and 1GB pages, the allocator will only have order-9 and
> order-19 pages; when order-9 pages run out, we can split order-19 pages;
> if possible, adjacent order-9 pages can be merged back to order-19 pages.

The idea is not crazy, but I think it is more functionality than we want
to throw at hugetlb.

IIRC, the default qemu huge page configuration uses THP.  Ideally, we
would have support for 1G THP pages and the user would not need to think
about any of this.  The kernel would back the VM with huge pages of the
appropriate size for best performance.

That may sound crazy, but I think it may be the looooong term goal.