Message ID | 20210309001855.142453-1-mike.kravetz@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | hugetlb: add demote/split page functionality | expand |
On 09.03.21 01:18, Mike Kravetz wrote: > The concurrent use of multiple hugetlb page sizes on a single system > is becoming more common. One of the reasons is better TLB support for > gigantic page sizes on x86 hardware. In addition, hugetlb pages are > being used to back VMs in hosting environments. > > When using hugetlb pages to back VMs in such environments, it is > sometimes desirable to preallocate hugetlb pools. This avoids the delay > and uncertainty of allocating hugetlb pages at VM startup. In addition, > preallocating huge pages minimizes the issue of memory fragmentation that > increases the longer the system is up and running. > > In such environments, a combination of larger and smaller hugetlb pages > are preallocated in anticipation of backing VMs of various sizes. Over > time, the preallocated pool of smaller hugetlb pages may become > depleted while larger hugetlb pages still remain. In such situations, > it may be desirable to convert larger hugetlb pages to smaller hugetlb > pages. > > Converting larger to smaller hugetlb pages can be accomplished today by > first freeing the larger page to the buddy allocator and then allocating > the smaller pages. However, there are two issues with this approach: > 1) This process can take quite some time, especially if allocation of > the smaller pages is not immediate and requires migration/compaction. > 2) There is no guarantee that the total size of smaller pages allocated > will match the size of the larger page which was freed. This is > because the area freed by the larger page could quickly be > fragmented. > > To address these issues, introduce the concept of hugetlb page demotion. > Demotion provides a means of 'in place' splitting a hugetlb page to > pages of a smaller size. For example, on x86 one 1G page can be > demoted to 512 2M pages. Page demotion is controlled via sysfs files. > - demote_size Read only target page size for demotion > - demote Writable number of hugetlb pages to be demoted > > Only hugetlb pages which are free at the time of the request can be demoted. > Demotion does not add to the complexity surplus pages. Demotion also honors > reserved huge pages. Therefore, when a value is written to the sysfs demote > file that value is only the maximum number of pages which will be demoted. > It is possible fewer will actually be demoted. > > If demote_size is PAGESIZE, demote will simply free pages to the buddy > allocator. With the vmemmap optimizations you will have to rework the vmemmap layout. How is that handled? Couldn't it happen that you are half-way through splitting a PUD into PMDs when you realize that you cannot allocate vmemmap pages for properly handling the remaining PMDs? What would happen then? Or are you planning on making both features mutually exclusive? Of course, one approach would be first completely restoring the vmemmap for the whole PUD (allocating more pages than necessary in the end) and then freeing individual pages again when optimizing the layout per PMD.
On 3/9/21 1:01 AM, David Hildenbrand wrote: > On 09.03.21 01:18, Mike Kravetz wrote: >> To address these issues, introduce the concept of hugetlb page demotion. >> Demotion provides a means of 'in place' splitting a hugetlb page to >> pages of a smaller size. For example, on x86 one 1G page can be >> demoted to 512 2M pages. Page demotion is controlled via sysfs files. >> - demote_size Read only target page size for demotion >> - demote Writable number of hugetlb pages to be demoted >> >> Only hugetlb pages which are free at the time of the request can be demoted. >> Demotion does not add to the complexity surplus pages. Demotion also honors >> reserved huge pages. Therefore, when a value is written to the sysfs demote >> file that value is only the maximum number of pages which will be demoted. >> It is possible fewer will actually be demoted. >> >> If demote_size is PAGESIZE, demote will simply free pages to the buddy >> allocator. > > With the vmemmap optimizations you will have to rework the vmemmap layout. How is that handled? Couldn't it happen that you are half-way through splitting a PUD into PMDs when you realize that you cannot allocate vmemmap pages for properly handling the remaining PMDs? What would happen then? > > Or are you planning on making both features mutually exclusive? > > Of course, one approach would be first completely restoring the vmemmap for the whole PUD (allocating more pages than necessary in the end) and then freeing individual pages again when optimizing the layout per PMD. > You are right about the need to address this issue. Patch 3 has the comment: + /* + * Note for future: + * When support for reducing vmemmap of huge pages is added, we + * will need to allocate vmemmap pages here and could fail. + */ The simplest approach would be to restore the entire vmemmmap for the larger page and then delete for smaller pages after the split. We could hook into the existing vmemmmap reduction code with just a few calls. This would fail to demote/split, if the allocation fails. However, this is not optimal. Ideally, the code would compute how many pages for vmemmmap are needed after the split, allocate those and then construct vmmemmap appropriately when creating the smaller pages. I think we would want to always do the allocation of vmmemmap pages up front and not even start the split process if the allocation fails. No sense starting something we may not be able to finish. I purposely did not address that here as first I wanted to get feedback on the usefulness demote functionality.
On 09.03.21 18:11, Mike Kravetz wrote: > On 3/9/21 1:01 AM, David Hildenbrand wrote: >> On 09.03.21 01:18, Mike Kravetz wrote: >>> To address these issues, introduce the concept of hugetlb page demotion. >>> Demotion provides a means of 'in place' splitting a hugetlb page to >>> pages of a smaller size. For example, on x86 one 1G page can be >>> demoted to 512 2M pages. Page demotion is controlled via sysfs files. >>> - demote_size Read only target page size for demotion >>> - demote Writable number of hugetlb pages to be demoted >>> >>> Only hugetlb pages which are free at the time of the request can be demoted. >>> Demotion does not add to the complexity surplus pages. Demotion also honors >>> reserved huge pages. Therefore, when a value is written to the sysfs demote >>> file that value is only the maximum number of pages which will be demoted. >>> It is possible fewer will actually be demoted. >>> >>> If demote_size is PAGESIZE, demote will simply free pages to the buddy >>> allocator. >> >> With the vmemmap optimizations you will have to rework the vmemmap layout. How is that handled? Couldn't it happen that you are half-way through splitting a PUD into PMDs when you realize that you cannot allocate vmemmap pages for properly handling the remaining PMDs? What would happen then? >> >> Or are you planning on making both features mutually exclusive? >> >> Of course, one approach would be first completely restoring the vmemmap for the whole PUD (allocating more pages than necessary in the end) and then freeing individual pages again when optimizing the layout per PMD. >> > > You are right about the need to address this issue. Patch 3 has the > comment: > > + /* > + * Note for future: > + * When support for reducing vmemmap of huge pages is added, we > + * will need to allocate vmemmap pages here and could fail. > + */ > I only skimmed over the cover letter so far. :) > The simplest approach would be to restore the entire vmemmmap for the > larger page and then delete for smaller pages after the split. We could > hook into the existing vmemmmap reduction code with just a few calls. > This would fail to demote/split, if the allocation fails. However, this > is not optimal. > > Ideally, the code would compute how many pages for vmemmmap are needed > after the split, allocate those and then construct vmmemmap > appropriately when creating the smaller pages. > > I think we would want to always do the allocation of vmmemmap pages up > front and not even start the split process if the allocation fails. No > sense starting something we may not be able to finish. > Makes sense. Another case might also be interesting: Assume you allocated a gigantic page via CMA and denoted it to huge pages. Theoretically (after Oscar's series!), we could come back later and re-allocate a gigantic page via CMA, migrating all now-hugepages out of the CMA region. Would require telling CMA that that area is effectively no longer allocated via CMA (adjusting accounting, bitmaps, etc). That would actually be a neat use case to form new gigantic pages later on when necessary :) But I assume your primary use case is denoting gigantic pages allocated during boot, not via CMA. Maybe you addresses that already as well :) > I purposely did not address that here as first I wanted to get feedback > on the usefulness demote functionality. > Makes sense. I think there could be some value in having this functionality. Gigantic pages are rare and we might want to keep them as long as possible (and as long as we have sufficient free memory). But once we need huge pages (e.g., smaller VMs, different granularity requiremets), we could denote. If we ever have pre-zeroing of huge/gigantic pages, your approach could also avoid having to zero huge pages again when the gigantic page was already zeroed.
On 3/9/21 9:50 AM, David Hildenbrand wrote: > On 09.03.21 18:11, Mike Kravetz wrote: >> On 3/9/21 1:01 AM, David Hildenbrand wrote: >>> On 09.03.21 01:18, Mike Kravetz wrote: >>>> To address these issues, introduce the concept of hugetlb page demotion. >>>> Demotion provides a means of 'in place' splitting a hugetlb page to >>>> pages of a smaller size. For example, on x86 one 1G page can be >>>> demoted to 512 2M pages. Page demotion is controlled via sysfs files. >>>> - demote_size Read only target page size for demotion >>>> - demote Writable number of hugetlb pages to be demoted >>>> >>>> Only hugetlb pages which are free at the time of the request can be demoted. >>>> Demotion does not add to the complexity surplus pages. Demotion also honors >>>> reserved huge pages. Therefore, when a value is written to the sysfs demote >>>> file that value is only the maximum number of pages which will be demoted. >>>> It is possible fewer will actually be demoted. >>>> >>>> If demote_size is PAGESIZE, demote will simply free pages to the buddy >>>> allocator. >>> >>> With the vmemmap optimizations you will have to rework the vmemmap layout. How is that handled? Couldn't it happen that you are half-way through splitting a PUD into PMDs when you realize that you cannot allocate vmemmap pages for properly handling the remaining PMDs? What would happen then? >>> >>> Or are you planning on making both features mutually exclusive? >>> >>> Of course, one approach would be first completely restoring the vmemmap for the whole PUD (allocating more pages than necessary in the end) and then freeing individual pages again when optimizing the layout per PMD. >>> >> >> You are right about the need to address this issue. Patch 3 has the >> comment: >> >> + /* >> + * Note for future: >> + * When support for reducing vmemmap of huge pages is added, we >> + * will need to allocate vmemmap pages here and could fail. >> + */ >> > > I only skimmed over the cover letter so far. :) > >> The simplest approach would be to restore the entire vmemmmap for the >> larger page and then delete for smaller pages after the split. We could >> hook into the existing vmemmmap reduction code with just a few calls. >> This would fail to demote/split, if the allocation fails. However, this >> is not optimal. >> >> Ideally, the code would compute how many pages for vmemmmap are needed >> after the split, allocate those and then construct vmmemmap >> appropriately when creating the smaller pages. >> >> I think we would want to always do the allocation of vmmemmap pages up >> front and not even start the split process if the allocation fails. No >> sense starting something we may not be able to finish. >> > > Makes sense. > > Another case might also be interesting: Assume you allocated a gigantic page via CMA and denoted it to huge pages. Theoretically (after Oscar's series!), we could come back later and re-allocate a gigantic page via CMA, migrating all now-hugepages out of the CMA region. Would require telling CMA that that area is effectively no longer allocated via CMA (adjusting accounting, bitmaps, etc). > I need to take a close look at Oscar's patches. Too many thing to look at/review :) This series does take into account gigantic pages allocated in CMA. Such pages can be demoted, and we need to track that they need to go back to CMA. Nothing super special for this, mostly a new hugetlb specific flag to track such pages. > That would actually be a neat use case to form new gigantic pages later on when necessary :) > > But I assume your primary use case is denoting gigantic pages allocated during boot, not via CMA. > > Maybe you addresses that already as well :) Yup. > >> I purposely did not address that here as first I wanted to get feedback >> on the usefulness demote functionality. >> > > Makes sense. I think there could be some value in having this functionality. Gigantic pages are rare and we might want to keep them as long as possible (and as long as we have sufficient free memory). But once we need huge pages (e.g., smaller VMs, different granularity requiremets), we could denote. > That is exactly the use case of one our product groups. Hoping to get feedback from others who might be doing something similar. > If we ever have pre-zeroing of huge/gigantic pages, your approach could also avoid having to zero huge pages again when the gigantic page was already zeroed. > Having a user option to pre-zero hugetlb pages is also on my 'to do' list. We now have hugetlb specific page flags to help with tracking.
> I need to take a close look at Oscar's patches. Too many thing to look > at/review :) > > This series does take into account gigantic pages allocated in CMA. > Such pages can be demoted, and we need to track that they need to go > back to CMA. Nothing super special for this, mostly a new hugetlb > specific flag to track such pages. Ah, just spotted it - patch #2 :) Took me a while to figure out that we end up calling cma_declare_contiguous_nid() with order_per_bit=0 - would have thought we would be using the actual smallest allocation order we end up using for huge/gigantic pages via CMA. Well, this way it "simply works".
On Mon, Mar 08, 2021 at 04:18:52PM -0800, Mike Kravetz wrote: > The concurrent use of multiple hugetlb page sizes on a single system > is becoming more common. One of the reasons is better TLB support for > gigantic page sizes on x86 hardware. In addition, hugetlb pages are > being used to back VMs in hosting environments. > > When using hugetlb pages to back VMs in such environments, it is > sometimes desirable to preallocate hugetlb pools. This avoids the delay > and uncertainty of allocating hugetlb pages at VM startup. In addition, > preallocating huge pages minimizes the issue of memory fragmentation that > increases the longer the system is up and running. > > In such environments, a combination of larger and smaller hugetlb pages > are preallocated in anticipation of backing VMs of various sizes. Over > time, the preallocated pool of smaller hugetlb pages may become > depleted while larger hugetlb pages still remain. In such situations, > it may be desirable to convert larger hugetlb pages to smaller hugetlb > pages. Hi Mike, The usecase sounds neat. > > Converting larger to smaller hugetlb pages can be accomplished today by > first freeing the larger page to the buddy allocator and then allocating > the smaller pages. However, there are two issues with this approach: > 1) This process can take quite some time, especially if allocation of > the smaller pages is not immediate and requires migration/compaction. > 2) There is no guarantee that the total size of smaller pages allocated > will match the size of the larger page which was freed. This is > because the area freed by the larger page could quickly be > fragmented. > > To address these issues, introduce the concept of hugetlb page demotion. > Demotion provides a means of 'in place' splitting a hugetlb page to > pages of a smaller size. For example, on x86 one 1G page can be > demoted to 512 2M pages. Page demotion is controlled via sysfs files. > - demote_size Read only target page size for demotion What about those archs where we have more than two hugetlb sizes? IIRC, in powerpc you can have that, right? If so, would it make sense for demote_size to be writable so you can pick the size? > - demote Writable number of hugetlb pages to be demoted Below you mention that due to reservation, the amount of demoted pages can be less than what the admin specified. Would it make sense to have a place where someone can check how many pages got actually demoted? Or will this follow nr_hugepages' scheme and will always reflect the number of current demoted pages? > Only hugetlb pages which are free at the time of the request can be demoted. > Demotion does not add to the complexity surplus pages. Demotion also honors > reserved huge pages. Therefore, when a value is written to the sysfs demote > file that value is only the maximum number of pages which will be demoted. > It is possible fewer will actually be demoted. > > If demote_size is PAGESIZE, demote will simply free pages to the buddy > allocator. Wrt. vmemmap discussion with David. I also think we could compute how many vmemmap pages we are going to need to re-shape the vmemmap layout and allocate those upfront. And I think this approach would be just more simple. I plan to have a look at the patches later today or tomorrow. Thanks
On Mon 08-03-21 16:18:52, Mike Kravetz wrote: [...] > Converting larger to smaller hugetlb pages can be accomplished today by > first freeing the larger page to the buddy allocator and then allocating > the smaller pages. However, there are two issues with this approach: > 1) This process can take quite some time, especially if allocation of > the smaller pages is not immediate and requires migration/compaction. > 2) There is no guarantee that the total size of smaller pages allocated > will match the size of the larger page which was freed. This is > because the area freed by the larger page could quickly be > fragmented. I will likely not surprise to show some level of reservation. While your concerns about reconfiguration by existing interfaces are quite real is this really a problem in practice? How often do you need such a reconfiguration? Is this all really worth the additional code to something as tricky as hugetlb code base? > include/linux/hugetlb.h | 8 ++ > mm/hugetlb.c | 199 +++++++++++++++++++++++++++++++++++++++- > 2 files changed, 204 insertions(+), 3 deletions(-) > > -- > 2.29.2 >
On 10 Mar 2021, at 11:23, Michal Hocko wrote: > On Mon 08-03-21 16:18:52, Mike Kravetz wrote: > [...] >> Converting larger to smaller hugetlb pages can be accomplished today by >> first freeing the larger page to the buddy allocator and then allocating >> the smaller pages. However, there are two issues with this approach: >> 1) This process can take quite some time, especially if allocation of >> the smaller pages is not immediate and requires migration/compaction. >> 2) There is no guarantee that the total size of smaller pages allocated >> will match the size of the larger page which was freed. This is >> because the area freed by the larger page could quickly be >> fragmented. > > I will likely not surprise to show some level of reservation. While your > concerns about reconfiguration by existing interfaces are quite real is > this really a problem in practice? How often do you need such a > reconfiguration? > > Is this all really worth the additional code to something as tricky as > hugetlb code base? > >> include/linux/hugetlb.h | 8 ++ >> mm/hugetlb.c | 199 +++++++++++++++++++++++++++++++++++++++- >> 2 files changed, 204 insertions(+), 3 deletions(-) >> >> -- >> 2.29.2 >> The high level goal of this patchset seems to enable flexible huge page allocation from a single pool, when multiple huge page sizes are available to use. The limitation of existing mechanism is that user has to specify how many huge pages he/she wants and how many gigantic pages he/she wants before the actual use. I just want to throw an idea here, please ignore if it is too crazy. Could we have a variant buddy allocator for huge page allocations, which only has available huge page orders in the free list? For example, if user wants 2MB and 1GB pages, the allocator will only have order-9 and order-19 pages; when order-9 pages run out, we can split order-19 pages; if possible, adjacent order-9 pages can be merged back to order-19 pages. — Best Regards, Yan Zi
On Wed 10-03-21 11:46:57, Zi Yan wrote: > On 10 Mar 2021, at 11:23, Michal Hocko wrote: > > > On Mon 08-03-21 16:18:52, Mike Kravetz wrote: > > [...] > >> Converting larger to smaller hugetlb pages can be accomplished today by > >> first freeing the larger page to the buddy allocator and then allocating > >> the smaller pages. However, there are two issues with this approach: > >> 1) This process can take quite some time, especially if allocation of > >> the smaller pages is not immediate and requires migration/compaction. > >> 2) There is no guarantee that the total size of smaller pages allocated > >> will match the size of the larger page which was freed. This is > >> because the area freed by the larger page could quickly be > >> fragmented. > > > > I will likely not surprise to show some level of reservation. While your > > concerns about reconfiguration by existing interfaces are quite real is > > this really a problem in practice? How often do you need such a > > reconfiguration? > > > > Is this all really worth the additional code to something as tricky as > > hugetlb code base? > > > >> include/linux/hugetlb.h | 8 ++ > >> mm/hugetlb.c | 199 +++++++++++++++++++++++++++++++++++++++- > >> 2 files changed, 204 insertions(+), 3 deletions(-) > >> > >> -- > >> 2.29.2 > >> > > The high level goal of this patchset seems to enable flexible huge page > allocation from a single pool, when multiple huge page sizes are available > to use. The limitation of existing mechanism is that user has to specify > how many huge pages he/she wants and how many gigantic pages he/she wants > before the actual use. I believe I have understood this part. And I am not questioning that. This seems useful. I am mostly asking whether we need such a flexibility. Mostly because of the additional code and future maintenance complexity which has turned to be a problem for a long time. Each new feature tends to just add on top of the existing complexity. > I just want to throw an idea here, please ignore if it is too crazy. > Could we have a variant buddy allocator for huge page allocations, > which only has available huge page orders in the free list? For example, > if user wants 2MB and 1GB pages, the allocator will only have order-9 and > order-19 pages; when order-9 pages run out, we can split order-19 pages; > if possible, adjacent order-9 pages can be merged back to order-19 pages. I assume you mean to remove those pages from the allocator when they are reserved rather than really used, right? I am not really sure how you want to deal with lower orders consuming/splitting too much from higher orders which then makes those unusable for the use even though they were preallocated for a specific workload. Another worry is that a gap between 2MB and 1GB pages is just too big so a single 2MB request from 1G pool will make the whole 1GB page unusable even when the smaller pool needs few pages.
On 10 Mar 2021, at 12:05, Michal Hocko wrote: > On Wed 10-03-21 11:46:57, Zi Yan wrote: >> On 10 Mar 2021, at 11:23, Michal Hocko wrote: >> >>> On Mon 08-03-21 16:18:52, Mike Kravetz wrote: >>> [...] >>>> Converting larger to smaller hugetlb pages can be accomplished today by >>>> first freeing the larger page to the buddy allocator and then allocating >>>> the smaller pages. However, there are two issues with this approach: >>>> 1) This process can take quite some time, especially if allocation of >>>> the smaller pages is not immediate and requires migration/compaction. >>>> 2) There is no guarantee that the total size of smaller pages allocated >>>> will match the size of the larger page which was freed. This is >>>> because the area freed by the larger page could quickly be >>>> fragmented. >>> >>> I will likely not surprise to show some level of reservation. While your >>> concerns about reconfiguration by existing interfaces are quite real is >>> this really a problem in practice? How often do you need such a >>> reconfiguration? >>> >>> Is this all really worth the additional code to something as tricky as >>> hugetlb code base? >>> >>>> include/linux/hugetlb.h | 8 ++ >>>> mm/hugetlb.c | 199 +++++++++++++++++++++++++++++++++++++++- >>>> 2 files changed, 204 insertions(+), 3 deletions(-) >>>> >>>> -- >>>> 2.29.2 >>>> >> >> The high level goal of this patchset seems to enable flexible huge page >> allocation from a single pool, when multiple huge page sizes are available >> to use. The limitation of existing mechanism is that user has to specify >> how many huge pages he/she wants and how many gigantic pages he/she wants >> before the actual use. > > I believe I have understood this part. And I am not questioning that. > This seems useful. I am mostly asking whether we need such a > flexibility. Mostly because of the additional code and future > maintenance complexity which has turned to be a problem for a long time. > Each new feature tends to just add on top of the existing complexity. I totally agree. This patchset looks to me like a partial functional replication of splitting high order free pages to lower order ones in buddy allocator. That is why I had the crazy idea below. > >> I just want to throw an idea here, please ignore if it is too crazy. >> Could we have a variant buddy allocator for huge page allocations, >> which only has available huge page orders in the free list? For example, >> if user wants 2MB and 1GB pages, the allocator will only have order-9 and >> order-19 pages; when order-9 pages run out, we can split order-19 pages; >> if possible, adjacent order-9 pages can be merged back to order-19 pages. > > I assume you mean to remove those pages from the allocator when they > are reserved rather than really used, right? I am not really sure how No. The allocator maintains all the reserved pages for huge page allocations, replacing existing cma_alloc or alloc_contig_pages. The kernel builds the free list when pages are reserved either during boot time or runtime. > you want to deal with lower orders consuming/splitting too much from > higher orders which then makes those unusable for the use even though > they were preallocated for a specific workload. Another worry is that a > gap between 2MB and 1GB pages is just too big so a single 2MB request > from 1G pool will make the whole 1GB page unusable even when the smaller > pool needs few pages. Yeah, the gap between 2MB and 1GB is large. The fragmentation will be a problem. Maybe we do not need it right now, since this patchset does not propose promoting/merging pages. Or we can reuse the existing anti fragmentation mechanisms but with pageblock set to gigantic page size in this pool. I admit my idea is a much intrusive change, but I feel that more functionality replications of core mm are added to hugetlb code, then why not reuse the core mm code. — Best Regards, Yan Zi
On 3/10/21 8:23 AM, Michal Hocko wrote: > On Mon 08-03-21 16:18:52, Mike Kravetz wrote: > [...] >> Converting larger to smaller hugetlb pages can be accomplished today by >> first freeing the larger page to the buddy allocator and then allocating >> the smaller pages. However, there are two issues with this approach: >> 1) This process can take quite some time, especially if allocation of >> the smaller pages is not immediate and requires migration/compaction. >> 2) There is no guarantee that the total size of smaller pages allocated >> will match the size of the larger page which was freed. This is >> because the area freed by the larger page could quickly be >> fragmented. > > I will likely not surprise to show some level of reservation. While your > concerns about reconfiguration by existing interfaces are quite real is > this really a problem in practice? How often do you need such a > reconfiguration? In reply to one of David's comments, I mentioned that we have a product group with this use case today. They use hugetlb pages to back VMs, and preallocate a 'best guess' number of pages of each each order. They can only guess how many pages of each order are needed because they are responding to dynamic requests for new VMs. When they find themselves in this situation today, they free 1G pages to buddy and try to allocate the corresponding number of 2M pages. The concerns above were mentioned/experienced by this group. Part of the reason for the RFC was to see if others might have similar use cases. With newer x86 processors, I hear about more people using 1G hugetlb pages. I also hear about people using hugetlb pages to back VMs. So, was thinking others may have similar use cases? > Is this all really worth the additional code to something as tricky as > hugetlb code base? > The 'good news' is that this does not involve much tricky code. It only demotes free hugetlb pages. Of course, it is only worth it the new code is actually going to be used. I know of at least one use case.
On 3/10/21 8:46 AM, Zi Yan wrote: > The high level goal of this patchset seems to enable flexible huge page > allocation from a single pool, when multiple huge page sizes are available > to use. The limitation of existing mechanism is that user has to specify > how many huge pages he/she wants and how many gigantic pages he/she wants > before the actual use. > > I just want to throw an idea here, please ignore if it is too crazy. > Could we have a variant buddy allocator for huge page allocations, > which only has available huge page orders in the free list? For example, > if user wants 2MB and 1GB pages, the allocator will only have order-9 and > order-19 pages; when order-9 pages run out, we can split order-19 pages; > if possible, adjacent order-9 pages can be merged back to order-19 pages. The idea is not crazy, but I think it is more functionality than we want to throw at hugetlb. IIRC, the default qemu huge page configuration uses THP. Ideally, we would have support for 1G THP pages and the user would not need to think about any of this. The kernel would back the VM with huge pages of the appropriate size for best performance. That may sound crazy, but I think it may be the looooong term goal.