Message ID | 20210816224953.157796-1-mike.kravetz@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | hugetlb: add demote/split page functionality | expand |
On Mon, 16 Aug 2021 15:49:45 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > This is a resend of PATCHes sent here [4]. There was some discussion > and interest when the RFC [5] was sent, but little after that. The > resend is just a rebase of [4] to next-20210816 with a few typos in > commmit messages fixed. > > Original Cover Letter > --------------------- > The concurrent use of multiple hugetlb page sizes on a single system > is becoming more common. One of the reasons is better TLB support for > gigantic page sizes on x86 hardware. In addition, hugetlb pages are > being used to back VMs in hosting environments. > > When using hugetlb pages to back VMs in such environments, it is > sometimes desirable to preallocate hugetlb pools. This avoids the delay > and uncertainty of allocating hugetlb pages at VM startup. In addition, > preallocating huge pages minimizes the issue of memory fragmentation that > increases the longer the system is up and running. > > In such environments, a combination of larger and smaller hugetlb pages > are preallocated in anticipation of backing VMs of various sizes. Over > time, the preallocated pool of smaller hugetlb pages may become > depleted while larger hugetlb pages still remain. In such situations, > it may be desirable to convert larger hugetlb pages to smaller hugetlb > pages. > > Converting larger to smaller hugetlb pages can be accomplished today by > first freeing the larger page to the buddy allocator and then allocating > the smaller pages. However, there are two issues with this approach: > 1) This process can take quite some time, especially if allocation of > the smaller pages is not immediate and requires migration/compaction. > 2) There is no guarantee that the total size of smaller pages allocated > will match the size of the larger page which was freed. This is > because the area freed by the larger page could quickly be > fragmented. > > To address these issues, introduce the concept of hugetlb page demotion. > Demotion provides a means of 'in place' splitting a hugetlb page to > pages of a smaller size. For example, on x86 one 1G page can be > demoted to 512 2M pages. Page demotion is controlled via sysfs files. > - demote_size Read only target page size for demotion Should this be "write only"? If not, I'm confused. If "yes" then "write only" would be a misnomer - clearly this file is readable (looks at demote_size_show()). > - demote Writable number of hugetlb pages to be demoted So how does this interface work? Write the target size to `demote_size', write the number of to-be-demoted larger pages to `demote' and then the operation happens? If so, how does one select which size pages should be selected for the demotion? And how does one know the operation has completed so the sysfs files can be reloaded for another operation? > Only hugetlb pages which are free at the time of the request can be demoted. > Demotion does not add to the complexity surplus pages. Demotion also honors > reserved huge pages. Therefore, when a value is written to the sysfs demote > file, that value is only the maximum number of pages which will be demoted. > It is possible fewer will actually be demoted. > > If demote_size is PAGESIZE, demote will simply free pages to the buddy > allocator. > > Real world use cases > -------------------- > There are groups today using hugetlb pages to back VMs on x86. Their > use case is as described above. They have experienced the issues with > performance and not necessarily getting the excepted number smaller huge ("expected") > pages after free/allocate cycle. > It seems odd to add the interfaces in patch 1 then document them in patch 5. Why not add-and-document in a single patch?
Also, pushback... On Mon, 16 Aug 2021 15:49:45 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > > Real world use cases > -------------------- > There are groups today using hugetlb pages to back VMs on x86. Their > use case is as described above. They have experienced the issues with > performance and not necessarily getting the excepted number smaller huge ("number of") > pages after free/allocate cycle. > It really is a ton of new code. I think we're owed much more detail about the problem than the above. To be confident that all this material is truly justified? Also, some selftests and benchmarking code in tools/testing/selftests would be helpful?
On 8/16/21 4:23 PM, Andrew Morton wrote: > On Mon, 16 Aug 2021 15:49:45 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > >> This is a resend of PATCHes sent here [4]. There was some discussion >> and interest when the RFC [5] was sent, but little after that. The >> resend is just a rebase of [4] to next-20210816 with a few typos in >> commmit messages fixed. >> >> Original Cover Letter >> --------------------- >> The concurrent use of multiple hugetlb page sizes on a single system >> is becoming more common. One of the reasons is better TLB support for >> gigantic page sizes on x86 hardware. In addition, hugetlb pages are >> being used to back VMs in hosting environments. >> >> When using hugetlb pages to back VMs in such environments, it is >> sometimes desirable to preallocate hugetlb pools. This avoids the delay >> and uncertainty of allocating hugetlb pages at VM startup. In addition, >> preallocating huge pages minimizes the issue of memory fragmentation that >> increases the longer the system is up and running. >> >> In such environments, a combination of larger and smaller hugetlb pages >> are preallocated in anticipation of backing VMs of various sizes. Over >> time, the preallocated pool of smaller hugetlb pages may become >> depleted while larger hugetlb pages still remain. In such situations, >> it may be desirable to convert larger hugetlb pages to smaller hugetlb >> pages. >> >> Converting larger to smaller hugetlb pages can be accomplished today by >> first freeing the larger page to the buddy allocator and then allocating >> the smaller pages. However, there are two issues with this approach: >> 1) This process can take quite some time, especially if allocation of >> the smaller pages is not immediate and requires migration/compaction. >> 2) There is no guarantee that the total size of smaller pages allocated >> will match the size of the larger page which was freed. This is >> because the area freed by the larger page could quickly be >> fragmented. >> >> To address these issues, introduce the concept of hugetlb page demotion. >> Demotion provides a means of 'in place' splitting a hugetlb page to >> pages of a smaller size. For example, on x86 one 1G page can be >> demoted to 512 2M pages. Page demotion is controlled via sysfs files. >> - demote_size Read only target page size for demotion > > Should this be "write only"? If not, I'm confused. > > If "yes" then "write only" would be a misnomer - clearly this file is > readable (looks at demote_size_show()). > It is read only and is there mostly as information for the user. When they demote a page, this is the size to which the page will be demoted. For example, # pwd /sys/kernel/mm/hugepages/hugepages-1048576kB # cat demote_size 2048kB # pwd /sys/kernel/mm/hugepages/hugepages-2048kB # cat demote_size 4kB The "demote size" is not user configurable. Although, that is something brought up by Oscar previously. I did not directly address this in the RFC. My bad. However, I do not like the idea of making demote_size writable/selectable. My concern would be someone changing the value and not resetting. It certainly is something that can be done with minor code changes. >> - demote Writable number of hugetlb pages to be demoted > > So how does this interface work? Write the target size to > `demote_size', write the number of to-be-demoted larger pages to > `demote' and then the operation happens? > > If so, how does one select which size pages should be selected for > the demotion? The location in the sysfs directory tells you what size pages will be demoted. For example, echo 5 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote says to demote 5 1GB pages. demote files are also in node specific directories so you can even pick huge pages from a specific node. echo 5 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/demote > > And how does one know the operation has completed so the sysfs files > can be reloaded for another operation? > When the write to the file is complete, the operation has completed. Not exactly sure what you mean by reloading the sysfs files for another operation? >> Only hugetlb pages which are free at the time of the request can be demoted. >> Demotion does not add to the complexity surplus pages. Demotion also honors >> reserved huge pages. Therefore, when a value is written to the sysfs demote >> file, that value is only the maximum number of pages which will be demoted. >> It is possible fewer will actually be demoted. >> >> If demote_size is PAGESIZE, demote will simply free pages to the buddy >> allocator. >> >> Real world use cases >> -------------------- >> There are groups today using hugetlb pages to back VMs on x86. Their >> use case is as described above. They have experienced the issues with >> performance and not necessarily getting the excepted number smaller huge > > ("expected") yes, will fix typo > >> pages after free/allocate cycle. >> > > It seems odd to add the interfaces in patch 1 then document them in > patch 5. Why not add-and-document in a single patch? > Yes, makes sense. Will combine these.
On Mon, 16 Aug 2021 17:17:50 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > And how does one know the operation has completed so the sysfs files > > can be reloaded for another operation? > > > > When the write to the file is complete, the operation has completed. > Not exactly sure what you mean by reloading the sysfs files for > another operation? If userspace wishes to perform another demote operation, it must wait for the preceding one to complete. Presumably if thread A is blocked in a write to `demote' and thread B concurrently tries to perform a demotion, thread B will be blocked as well? Was this tested? Lots of things are to be added to changelogs & documentation, methinks.
On 8/16/21 4:27 PM, Andrew Morton wrote: > Also, pushback... That is welcome. I only have the one specific use case mentioned here. > > On Mon, 16 Aug 2021 15:49:45 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > >> >> Real world use cases >> -------------------- >> There are groups today using hugetlb pages to back VMs on x86. Their >> use case is as described above. They have experienced the issues with >> performance and not necessarily getting the excepted number smaller huge > > ("number of") thanks, another typo to fix. > >> pages after free/allocate cycle. >> > > It really is a ton of new code. I think we're owed much more detail > about the problem than the above. To be confident that all this > material is truly justified? The desired functionality for this specific use case is to simply convert a 1G huegtlb page to 512 2MB hugetlb pages. As mentioned "Converting larger to smaller hugetlb pages can be accomplished today by first freeing the larger page to the buddy allocator and then allocating the smaller pages. However, there are two issues with this approach: 1) This process can take quite some time, especially if allocation of the smaller pages is not immediate and requires migration/compaction. 2) There is no guarantee that the total size of smaller pages allocated will match the size of the larger page which was freed. This is because the area freed by the larger page could quickly be fragmented." These two issues have been experienced in practice. A big chunk of the code changes (aprox 50%) is for the vmemmap optimizations. This is also the most complex part of the changes. I added the code as interaction with vmemmap reduction was discussed during the RFC. It is only a performance enhancement and honestly may not be worth the cost/risk. I will get some numbers to measure the actual benefit. > > Also, some selftests and benchmarking code in tools/testing/selftests > would be helpful? > Will do.
On 8/16/21 5:39 PM, Andrew Morton wrote: > On Mon, 16 Aug 2021 17:17:50 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > >>> >>> And how does one know the operation has completed so the sysfs files >>> can be reloaded for another operation? >>> >> >> When the write to the file is complete, the operation has completed. >> Not exactly sure what you mean by reloading the sysfs files for >> another operation? > > If userspace wishes to perform another demote operation, it must wait > for the preceding one to complete. > > Presumably if thread A is blocked in a write to `demote' and thread B > concurrently tries to perform a demotion, thread B will be blocked as > well? Was this tested? I must admit that I did not specifically test this. However, the patch series to make freeing of hugetlb pages IRQ safe added a (per-hstate)mutex that is taken when sysfs files modify the number of huge pages. demote writes take this mutex (patch 1). That synchronizes not only concurrent writes to demote but also concurrent modifications of nr_hugepages. > > Lots of things are to be added to changelogs & documentation, methinks. > OK. Thank you for taking a look at this!
On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > > It really is a ton of new code. I think we're owed much more detail > > about the problem than the above. To be confident that all this > > material is truly justified? > > The desired functionality for this specific use case is to simply > convert a 1G huegtlb page to 512 2MB hugetlb pages. As mentioned > > "Converting larger to smaller hugetlb pages can be accomplished today by > first freeing the larger page to the buddy allocator and then allocating > the smaller pages. However, there are two issues with this approach: > 1) This process can take quite some time, especially if allocation of > the smaller pages is not immediate and requires migration/compaction. > 2) There is no guarantee that the total size of smaller pages allocated > will match the size of the larger page which was freed. This is > because the area freed by the larger page could quickly be > fragmented." > > These two issues have been experienced in practice. Well the first issue is quantifiable. What is "some time"? If it's people trying to get a 5% speedup on a rare operation because hey, bugging the kernel developers doesn't cost me anything then perhaps we have better things to be doing. And the second problem would benefit from some words to help us understand how much real-world hurt this causes, and how frequently. And let's understand what the userspace workarounds look like, etc. > A big chunk of the code changes (aprox 50%) is for the vmemmap > optimizations. This is also the most complex part of the changes. > I added the code as interaction with vmemmap reduction was discussed > during the RFC. It is only a performance enhancement and honestly > may not be worth the cost/risk. I will get some numbers to measure > the actual benefit. Cool.
On 17.08.21 03:46, Andrew Morton wrote: > On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > >>> It really is a ton of new code. I think we're owed much more detail >>> about the problem than the above. To be confident that all this >>> material is truly justified? >> >> The desired functionality for this specific use case is to simply >> convert a 1G huegtlb page to 512 2MB hugetlb pages. As mentioned >> >> "Converting larger to smaller hugetlb pages can be accomplished today by >> first freeing the larger page to the buddy allocator and then allocating >> the smaller pages. However, there are two issues with this approach: >> 1) This process can take quite some time, especially if allocation of >> the smaller pages is not immediate and requires migration/compaction. >> 2) There is no guarantee that the total size of smaller pages allocated >> will match the size of the larger page which was freed. This is >> because the area freed by the larger page could quickly be >> fragmented." >> >> These two issues have been experienced in practice. > > Well the first issue is quantifiable. What is "some time"? If it's > people trying to get a 5% speedup on a rare operation because hey, > bugging the kernel developers doesn't cost me anything then perhaps we > have better things to be doing. > > And the second problem would benefit from some words to help us > understand how much real-world hurt this causes, and how frequently. > And let's understand what the userspace workarounds look like, etc. > >> A big chunk of the code changes (aprox 50%) is for the vmemmap >> optimizations. This is also the most complex part of the changes. >> I added the code as interaction with vmemmap reduction was discussed >> during the RFC. It is only a performance enhancement and honestly >> may not be worth the cost/risk. I will get some numbers to measure >> the actual benefit. If it really makes that much of a difference code/complexity wise, would it make sense to just limit denote functionality to the !vmemmap case for now?
On 8/17/21 12:30 AM, David Hildenbrand wrote: > On 17.08.21 03:46, Andrew Morton wrote: >> On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: >> >>>> It really is a ton of new code. I think we're owed much more detail >>>> about the problem than the above. To be confident that all this >>>> material is truly justified? >>> >>> The desired functionality for this specific use case is to simply >>> convert a 1G huegtlb page to 512 2MB hugetlb pages. As mentioned >>> >>> "Converting larger to smaller hugetlb pages can be accomplished today by >>> first freeing the larger page to the buddy allocator and then allocating >>> the smaller pages. However, there are two issues with this approach: >>> 1) This process can take quite some time, especially if allocation of >>> the smaller pages is not immediate and requires migration/compaction. >>> 2) There is no guarantee that the total size of smaller pages allocated >>> will match the size of the larger page which was freed. This is >>> because the area freed by the larger page could quickly be >>> fragmented." >>> >>> These two issues have been experienced in practice. >> >> Well the first issue is quantifiable. What is "some time"? If it's >> people trying to get a 5% speedup on a rare operation because hey, >> bugging the kernel developers doesn't cost me anything then perhaps we >> have better things to be doing. >> >> And the second problem would benefit from some words to help us >> understand how much real-world hurt this causes, and how frequently. >> And let's understand what the userspace workarounds look like, etc. >> >>> A big chunk of the code changes (aprox 50%) is for the vmemmap >>> optimizations. This is also the most complex part of the changes. >>> I added the code as interaction with vmemmap reduction was discussed >>> during the RFC. It is only a performance enhancement and honestly >>> may not be worth the cost/risk. I will get some numbers to measure >>> the actual benefit. > > If it really makes that much of a difference code/complexity wise, would it make sense to just limit denote functionality to the !vmemmap case for now? > Handling vmemmap optimized huge pages is not that big of a deal. We just use the existing functionality to populate vmemmap for the page being demoted, and free vmemmap for resulting pages of demoted size. This obviously is not 'optimal' for demote as we will allocate more vmemmap pages than needed and then free the excess pages. The complex part is not over allocating vmemmap and only sparsely populating vmemmap for the target pages of demote size. This is all done in patches 6-8. I am happy to drop these patches for now. The are the most complex (and ugly) of this series. As mentioned, they do not provide any additional functionality.
On 17.08.21 18:19, Mike Kravetz wrote: > On 8/17/21 12:30 AM, David Hildenbrand wrote: >> On 17.08.21 03:46, Andrew Morton wrote: >>> On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: >>> >>>>> It really is a ton of new code. I think we're owed much more detail >>>>> about the problem than the above. To be confident that all this >>>>> material is truly justified? >>>> >>>> The desired functionality for this specific use case is to simply >>>> convert a 1G huegtlb page to 512 2MB hugetlb pages. As mentioned >>>> >>>> "Converting larger to smaller hugetlb pages can be accomplished today by >>>> first freeing the larger page to the buddy allocator and then allocating >>>> the smaller pages. However, there are two issues with this approach: >>>> 1) This process can take quite some time, especially if allocation of >>>> the smaller pages is not immediate and requires migration/compaction. >>>> 2) There is no guarantee that the total size of smaller pages allocated >>>> will match the size of the larger page which was freed. This is >>>> because the area freed by the larger page could quickly be >>>> fragmented." >>>> >>>> These two issues have been experienced in practice. >>> >>> Well the first issue is quantifiable. What is "some time"? If it's >>> people trying to get a 5% speedup on a rare operation because hey, >>> bugging the kernel developers doesn't cost me anything then perhaps we >>> have better things to be doing. >>> >>> And the second problem would benefit from some words to help us >>> understand how much real-world hurt this causes, and how frequently. >>> And let's understand what the userspace workarounds look like, etc. >>> >>>> A big chunk of the code changes (aprox 50%) is for the vmemmap >>>> optimizations. This is also the most complex part of the changes. >>>> I added the code as interaction with vmemmap reduction was discussed >>>> during the RFC. It is only a performance enhancement and honestly >>>> may not be worth the cost/risk. I will get some numbers to measure >>>> the actual benefit. >> >> If it really makes that much of a difference code/complexity wise, would it make sense to just limit denote functionality to the !vmemmap case for now? >> > > Handling vmemmap optimized huge pages is not that big of a deal. We > just use the existing functionality to populate vmemmap for the page > being demoted, and free vmemmap for resulting pages of demoted size. > > This obviously is not 'optimal' for demote as we will allocate more > vmemmap pages than needed and then free the excess pages. The complex > part is not over allocating vmemmap and only sparsely populating vmemmap > for the target pages of demote size. This is all done in patches 6-8. > I am happy to drop these patches for now. The are the most complex (and > ugly) of this series. As mentioned, they do not provide any additional > functionality. > Just looking at the diffstat, that looks like a good idea to me :)
Add Vlastimil and Hillf, On 8/16/21 6:46 PM, Andrew Morton wrote: > On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > >>> It really is a ton of new code. I think we're owed much more detail >>> about the problem than the above. To be confident that all this >>> material is truly justified? >> >> The desired functionality for this specific use case is to simply >> convert a 1G huegtlb page to 512 2MB hugetlb pages. As mentioned >> >> "Converting larger to smaller hugetlb pages can be accomplished today by >> first freeing the larger page to the buddy allocator and then allocating >> the smaller pages. However, there are two issues with this approach: >> 1) This process can take quite some time, especially if allocation of >> the smaller pages is not immediate and requires migration/compaction. >> 2) There is no guarantee that the total size of smaller pages allocated >> will match the size of the larger page which was freed. This is >> because the area freed by the larger page could quickly be >> fragmented." >> >> These two issues have been experienced in practice. > > Well the first issue is quantifiable. What is "some time"? If it's > people trying to get a 5% speedup on a rare operation because hey, > bugging the kernel developers doesn't cost me anything then perhaps we > have better things to be doing. Well, I set up a test environment on a larger system to get some numbers. My 'load' on the system was filling the page cache with clean pages. The thought is that these pages could easily be reclaimed. When trying to get numbers I hit a hugetlb page allocation stall where __alloc_pages(__GFP_RETRY_MAYFAIL, order 9) would stall forever (or at least an hour). It was very much like the symptoms addressed here: https://lore.kernel.org/linux-mm/20190806014744.15446-1-mike.kravetz@oracle.com/ This was on 5.14.0-rc6-next-20210820. I'll do some more digging as this appears to be some dark corner case of reclaim and/or compaction. The 'good news' is that I can reproduce this. > And the second problem would benefit from some words to help us > understand how much real-world hurt this causes, and how frequently. > And let's understand what the userspace workarounds look like, etc. The stall above was from doing a simple 'free 1GB page' followed by 'allocate 512 MB pages' from userspace. Getting out another version of this series will be delayed, as I think we need to address or understand this issue first.
On Tue, 24 Aug 2021 15:08:46 -0700 Mike Kravetz wrote: >On 8/16/21 6:46 PM, Andrew Morton wrote: >> On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: >> >>>> It really is a ton of new code. I think we're owed much more detail >>>> about the problem than the above. To be confident that all this >>>> material is truly justified? >>> >>> The desired functionality for this specific use case is to simply >>> convert a 1G huegtlb page to 512 2MB hugetlb pages. As mentioned >>> >>> "Converting larger to smaller hugetlb pages can be accomplished today by >>> first freeing the larger page to the buddy allocator and then allocating >>> the smaller pages. However, there are two issues with this approach: >>> 1) This process can take quite some time, especially if allocation of >>> the smaller pages is not immediate and requires migration/compaction. >>> 2) There is no guarantee that the total size of smaller pages allocated >>> will match the size of the larger page which was freed. This is >>> because the area freed by the larger page could quickly be >>> fragmented." >>> >>> These two issues have been experienced in practice. >> >> Well the first issue is quantifiable. What is "some time"? If it's >> people trying to get a 5% speedup on a rare operation because hey, >> bugging the kernel developers doesn't cost me anything then perhaps we >> have better things to be doing. > >Well, I set up a test environment on a larger system to get some >numbers. My 'load' on the system was filling the page cache with >clean pages. The thought is that these pages could easily be reclaimed. > >When trying to get numbers I hit a hugetlb page allocation stall where >__alloc_pages(__GFP_RETRY_MAYFAIL, order 9) would stall forever (or at >least an hour). It was very much like the symptoms addressed here: >https://lore.kernel.org/linux-mm/20190806014744.15446-1-mike.kravetz@oracle.com/ > >This was on 5.14.0-rc6-next-20210820. > >I'll do some more digging as this appears to be some dark corner case of >reclaim and/or compaction. The 'good news' is that I can reproduce >this. I am on vacation until 1 Sep, with an ear on any light on the corner cases. Hillf > >> And the second problem would benefit from some words to help us >> understand how much real-world hurt this causes, and how frequently. >> And let's understand what the userspace workarounds look like, etc. > >The stall above was from doing a simple 'free 1GB page' followed by >'allocate 512 MB pages' from userspace. > >Getting out another version of this series will be delayed, as I think >we need to address or understand this issue first. >-- >Mike Kravetz
On 8/25/21 00:08, Mike Kravetz wrote: > Add Vlastimil and Hillf, > > Well, I set up a test environment on a larger system to get some > numbers. My 'load' on the system was filling the page cache with > clean pages. The thought is that these pages could easily be reclaimed. > > When trying to get numbers I hit a hugetlb page allocation stall where > __alloc_pages(__GFP_RETRY_MAYFAIL, order 9) would stall forever (or at > least an hour). It was very much like the symptoms addressed here: > https://lore.kernel.org/linux-mm/20190806014744.15446-1-mike.kravetz@oracle.com/ > > This was on 5.14.0-rc6-next-20210820. > > I'll do some more digging as this appears to be some dark corner case of > reclaim and/or compaction. The 'good news' is that I can reproduce > this. Interesting, let's see if that's some kind of new regression. >> And the second problem would benefit from some words to help us >> understand how much real-world hurt this causes, and how frequently. >> And let's understand what the userspace workarounds look like, etc. > > The stall above was from doing a simple 'free 1GB page' followed by > 'allocate 512 MB pages' from userspace. Is the allocation different in any way than the usual hugepage allocation possible today? > Getting out another version of this series will be delayed, as I think > we need to address or understand this issue first. >
On 8/27/21 10:22 AM, Vlastimil Babka wrote: > On 8/25/21 00:08, Mike Kravetz wrote: >> Add Vlastimil and Hillf, >> >> Well, I set up a test environment on a larger system to get some >> numbers. My 'load' on the system was filling the page cache with >> clean pages. The thought is that these pages could easily be reclaimed. >> >> When trying to get numbers I hit a hugetlb page allocation stall where >> __alloc_pages(__GFP_RETRY_MAYFAIL, order 9) would stall forever (or at >> least an hour). It was very much like the symptoms addressed here: >> https://lore.kernel.org/linux-mm/20190806014744.15446-1-mike.kravetz@oracle.com/ >> >> This was on 5.14.0-rc6-next-20210820. >> >> I'll do some more digging as this appears to be some dark corner case of >> reclaim and/or compaction. The 'good news' is that I can reproduce >> this. > > Interesting, let's see if that's some kind of new regression. > >>> And the second problem would benefit from some words to help us >>> understand how much real-world hurt this causes, and how frequently. >>> And let's understand what the userspace workarounds look like, etc. >> >> The stall above was from doing a simple 'free 1GB page' followed by >> 'allocate 512 MB pages' from userspace. > > Is the allocation different in any way than the usual hugepage allocation > possible today? No, it is the same. I was just following the demote use case of free 1GB page and allocate 512 2MB pages. Of course, you would mostly expect that to succeed. The exception is if something else is running and grabs some of those 1GB worth of contiguous pages such that you can not allocate 512 2MB pages. My test case was to have those freed pages be used for file I/O but kept clean so that could easily be reclaimed. I 'may' have been over stressing the system with all CPUs doing file reads to fill the page cache with clean pages. I certainly need to spend some more debug/analysis time on this.
On 8/28/21 01:04, Mike Kravetz wrote: > On 8/27/21 10:22 AM, Vlastimil Babka wrote: > I 'may' have been over stressing the system with all CPUs doing file > reads to fill the page cache with clean pages. I certainly need to > spend some more debug/analysis time on this. Hm that *could* play a role, as these will allow reclaim to make progress, but also the reclaimed pages might be stolen immediately and compaction will return COMPACT_SKIPPED and in should_compact_retry() we might go through this code path: /* * compaction was skipped because there are not enough order-0 pages * to work with, so we retry only if it looks like reclaim can help. */ if (compaction_needs_reclaim(compact_result)) { ret = compaction_zonelist_suitable(ac, order, alloc_flags); goto out; } where compaction_zonelist_suitable() will return true because it appears reclaim can free pages to allow progress. And there are no max retries applied for this case. With the reclaim and compaction tracepoints it should be possible to confirm this scenario.
On 8/30/21 3:11 AM, Vlastimil Babka wrote: > On 8/28/21 01:04, Mike Kravetz wrote: >> On 8/27/21 10:22 AM, Vlastimil Babka wrote: >> I 'may' have been over stressing the system with all CPUs doing file >> reads to fill the page cache with clean pages. I certainly need to >> spend some more debug/analysis time on this. > > Hm that *could* play a role, as these will allow reclaim to make progress, but > also the reclaimed pages might be stolen immediately and compaction will return > COMPACT_SKIPPED and in should_compact_retry() we might go through this code path: > > /* > * compaction was skipped because there are not enough order-0 pages > * to work with, so we retry only if it looks like reclaim can help. > */ > if (compaction_needs_reclaim(compact_result)) { > ret = compaction_zonelist_suitable(ac, order, alloc_flags); > goto out; > } > > where compaction_zonelist_suitable() will return true because it appears > reclaim can free pages to allow progress. And there are no max retries > applied for this case. > With the reclaim and compaction tracepoints it should be possible to > confirm this scenario. Here is some very high level information from a long stall that was interrupted. This was an order 9 allocation from alloc_buddy_huge_page(). 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1 node 0 FAIL [55269.539893] r_tries 25 c_tries 609647 reclaim 47325161 compact 607 Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted. should_reclaim_retry returned true 25 times and should_compact_retry returned true 609647 times. Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and 607 jiffies spent in __alloc_pages_direct_compact. Looks like both reclaim retries > MAX_RECLAIM_RETRIES and compaction retries > MAX_COMPACT_RETRIES
On 9/2/21 20:17, Mike Kravetz wrote: > On 8/30/21 3:11 AM, Vlastimil Babka wrote: >> On 8/28/21 01:04, Mike Kravetz wrote: >>> On 8/27/21 10:22 AM, Vlastimil Babka wrote: >>> I 'may' have been over stressing the system with all CPUs doing file >>> reads to fill the page cache with clean pages. I certainly need to >>> spend some more debug/analysis time on this. >> >> Hm that *could* play a role, as these will allow reclaim to make progress, but >> also the reclaimed pages might be stolen immediately and compaction will return >> COMPACT_SKIPPED and in should_compact_retry() we might go through this code path: >> >> /* >> * compaction was skipped because there are not enough order-0 pages >> * to work with, so we retry only if it looks like reclaim can help. >> */ >> if (compaction_needs_reclaim(compact_result)) { >> ret = compaction_zonelist_suitable(ac, order, alloc_flags); >> goto out; >> } >> >> where compaction_zonelist_suitable() will return true because it appears >> reclaim can free pages to allow progress. And there are no max retries >> applied for this case. >> With the reclaim and compaction tracepoints it should be possible to >> confirm this scenario. > > Here is some very high level information from a long stall that was > interrupted. This was an order 9 allocation from alloc_buddy_huge_page(). > > 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1 node 0 FAIL > [55269.539893] r_tries 25 c_tries 609647 reclaim 47325161 compact 607 > > Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted. > should_reclaim_retry returned true 25 times and should_compact_retry returned > true 609647 times. > Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and > 607 jiffies spent in __alloc_pages_direct_compact. > > Looks like both > reclaim retries > MAX_RECLAIM_RETRIES > and > compaction retries > MAX_COMPACT_RETRIES > Yeah AFAICS that's only possible with the scenario I suspected. I guess we should put a limit on compact retries (maybe some multiple of MAX_COMPACT_RETRIES) even if it thinks that reclaim could help, while clearly it doesn't (i.e. because somebody else is stealing the page like in your test case).
On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote: >On 9/2/21 20:17, Mike Kravetz wrote: >> >> Here is some very high level information from a long stall that was >> interrupted. This was an order 9 allocation from alloc_buddy_huge_page(). >> >> 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1 node 0 FAIL >> [55269.539893] r_tries 25 c_tries 609647 reclaim 47325161 compact 607 >> >> Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted. >> should_reclaim_retry returned true 25 times and should_compact_retry returned >> true 609647 times. >> Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and >> 607 jiffies spent in __alloc_pages_direct_compact. >> >> Looks like both >> reclaim retries > MAX_RECLAIM_RETRIES >> and >> compaction retries > MAX_COMPACT_RETRIES >> >Yeah AFAICS that's only possible with the scenario I suspected. I guess >we should put a limit on compact retries (maybe some multiple of >MAX_COMPACT_RETRIES) even if it thinks that reclaim could help, while >clearly it doesn't (i.e. because somebody else is stealing the page like >in your test case). And/or clamp reclaim retries for costly orders reclaim retries = MAX_RECLAIM_RETRIES - order; to pull down the chance for stall as low as possible.
On 9/7/21 1:50 AM, Hillf Danton wrote: > On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote: >> On 9/2/21 20:17, Mike Kravetz wrote: >>> >>> Here is some very high level information from a long stall that was >>> interrupted. This was an order 9 allocation from alloc_buddy_huge_page(). >>> >>> 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1 node 0 FAIL >>> [55269.539893] r_tries 25 c_tries 609647 reclaim 47325161 compact 607 >>> >>> Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted. >>> should_reclaim_retry returned true 25 times and should_compact_retry returned >>> true 609647 times. >>> Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and >>> 607 jiffies spent in __alloc_pages_direct_compact. >>> >>> Looks like both >>> reclaim retries > MAX_RECLAIM_RETRIES >>> and >>> compaction retries > MAX_COMPACT_RETRIES >>> >> Yeah AFAICS that's only possible with the scenario I suspected. I guess >> we should put a limit on compact retries (maybe some multiple of >> MAX_COMPACT_RETRIES) even if it thinks that reclaim could help, while >> clearly it doesn't (i.e. because somebody else is stealing the page like >> in your test case). > > And/or clamp reclaim retries for costly orders > > reclaim retries = MAX_RECLAIM_RETRIES - order; > > to pull down the chance for stall as low as possible. Thanks, and sorry for not replying quickly. I only get back to this as time allows. We could clamp the number of compaction and reclaim retries in __alloc_pages_slowpath as suggested. However, I noticed that a single reclaim call could take a bunch of time. As a result, I instrumented shrink_node to see what might be happening. Here is some information from a long stall. Note that I only dump stats when jiffies > 100000. [ 8136.874706] shrink_node: 507654 total jiffies, 3557110 tries [ 8136.881130] 130596341 reclaimed, 32 nr_to_reclaim [ 8136.887643] compaction_suitable results: [ 8136.893276] idx COMPACT_SKIPPED, 3557109 [ 8672.399839] shrink_node: 522076 total jiffies, 3466228 tries [ 8672.406268] 124427720 reclaimed, 32 nr_to_reclaim [ 8672.412782] compaction_suitable results: [ 8672.418421] idx COMPACT_SKIPPED, 3466227 [ 8908.099592] __alloc_pages_slowpath: jiffies 2939938 tries 17068 cpu_tries 1 node 0 success [ 8908.109120] r_tries 11 c_tries 17056 reclaim 2939865 compact 9 In this case, clamping the number of retries from should_compact_retry and should_reclaim_retry could help. Mostly because we will not be calling back into the reclaim code? Notice the long amount of time spent in shrink_node. The 'tries' in shrink_node come about from that: if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, sc)) goto again; compaction_suitable results is the values returned from calls to should_continue_reclaim -> compaction_suitable. Trying to think if there might be an intelligent way to quit early.
On Wed, 8 Sep 2021 14:00:19 -0700 Mike Kravetz wrote: >On 9/7/21 1:50 AM, Hillf Danton wrote: >> On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote: >>> On 9/2/21 20:17, Mike Kravetz wrote: >>>> >>>> Here is some very high level information from a long stall that was >>>> interrupted. This was an order 9 allocation from alloc_buddy_huge_page(). >>>> >>>> 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1 node 0 FAIL >>>> [55269.539893] r_tries 25 c_tries 609647 reclaim 47325161 compact 607 >>>> >>>> Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted. >>>> should_reclaim_retry returned true 25 times and should_compact_retry returned >>>> true 609647 times. >>>> Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and >>>> 607 jiffies spent in __alloc_pages_direct_compact. >>>> >>>> Looks like both >>>> reclaim retries > MAX_RECLAIM_RETRIES >>>> and >>>> compaction retries > MAX_COMPACT_RETRIES >>>> >>> Yeah AFAICS that's only possible with the scenario I suspected. I guess >>> we should put a limit on compact retries (maybe some multiple of >>> MAX_COMPACT_RETRIES) even if it thinks that reclaim could help, while >>> clearly it doesn't (i.e. because somebody else is stealing the page like >>> in your test case). >> >> And/or clamp reclaim retries for costly orders >> >> reclaim retries = MAX_RECLAIM_RETRIES - order; >> >> to pull down the chance for stall as low as possible. > >Thanks, and sorry for not replying quickly. I only get back to this as >time allows. > >We could clamp the number of compaction and reclaim retries in >__alloc_pages_slowpath as suggested. However, I noticed that a single >reclaim call could take a bunch of time. As a result, I instrumented >shrink_node to see what might be happening. Here is some information >from a long stall. Note that I only dump stats when jiffies > 100000. > >[ 8136.874706] shrink_node: 507654 total jiffies, 3557110 tries >[ 8136.881130] 130596341 reclaimed, 32 nr_to_reclaim >[ 8136.887643] compaction_suitable results: >[ 8136.893276] idx COMPACT_SKIPPED, 3557109 >[ 8672.399839] shrink_node: 522076 total jiffies, 3466228 tries >[ 8672.406268] 124427720 reclaimed, 32 nr_to_reclaim >[ 8672.412782] compaction_suitable results: >[ 8672.418421] idx COMPACT_SKIPPED, 3466227 >[ 8908.099592] __alloc_pages_slowpath: jiffies 2939938 tries 17068 cpu_tries 1 node 0 success >[ 8908.109120] r_tries 11 c_tries 17056 reclaim 2939865 compact 9 > >In this case, clamping the number of retries from should_compact_retry >and should_reclaim_retry could help. Mostly because we will not be >calling back into the reclaim code? Notice the long amount of time spent >in shrink_node. The 'tries' in shrink_node come about from that: > > if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, > sc)) > goto again; > >compaction_suitable results is the values returned from calls to >should_continue_reclaim -> compaction_suitable. > >Trying to think if there might be an intelligent way to quit early. Given the downgrade of costly order to zero on the kswapd side, what you found suggests the need to bridge the gap between sc->nr_to_reclaim and compact_gap(sc->order) for direct reclaims. If nr_to_reclaim is the primary target for direct reclaim, one option is ask kswapd to do costly order reclaims, with stall handled by r_tries and c_tries in the slowpath. +++ x/mm/vmscan.c @@ -3220,7 +3220,11 @@ again: if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, sc)) - goto again; + if (!current_is_kswapd() && sc->nr_reclaimed >= + sc->nr_to_reclaim) + /* job done */ ; + else + goto again; /* * Kswapd gives up on balancing particular nodes after too
On Wed 08-09-21 14:00:19, Mike Kravetz wrote: > On 9/7/21 1:50 AM, Hillf Danton wrote: > > On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote: > >> On 9/2/21 20:17, Mike Kravetz wrote: > >>> > >>> Here is some very high level information from a long stall that was > >>> interrupted. This was an order 9 allocation from alloc_buddy_huge_page(). > >>> > >>> 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1 node 0 FAIL > >>> [55269.539893] r_tries 25 c_tries 609647 reclaim 47325161 compact 607 > >>> > >>> Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted. > >>> should_reclaim_retry returned true 25 times and should_compact_retry returned > >>> true 609647 times. > >>> Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and > >>> 607 jiffies spent in __alloc_pages_direct_compact. > >>> > >>> Looks like both > >>> reclaim retries > MAX_RECLAIM_RETRIES > >>> and > >>> compaction retries > MAX_COMPACT_RETRIES > >>> > >> Yeah AFAICS that's only possible with the scenario I suspected. I guess > >> we should put a limit on compact retries (maybe some multiple of > >> MAX_COMPACT_RETRIES) even if it thinks that reclaim could help, while > >> clearly it doesn't (i.e. because somebody else is stealing the page like > >> in your test case). > > > > And/or clamp reclaim retries for costly orders > > > > reclaim retries = MAX_RECLAIM_RETRIES - order; > > > > to pull down the chance for stall as low as possible. > > Thanks, and sorry for not replying quickly. I only get back to this as > time allows. > > We could clamp the number of compaction and reclaim retries in > __alloc_pages_slowpath as suggested. However, I noticed that a single > reclaim call could take a bunch of time. As a result, I instrumented > shrink_node to see what might be happening. Here is some information > from a long stall. Note that I only dump stats when jiffies > 100000. > > [ 8136.874706] shrink_node: 507654 total jiffies, 3557110 tries > [ 8136.881130] 130596341 reclaimed, 32 nr_to_reclaim > [ 8136.887643] compaction_suitable results: > [ 8136.893276] idx COMPACT_SKIPPED, 3557109 Can you get a more detailed break down of where the time is spent. Also How come the number of reclaimed pages is so excessive comparing to the reclaim target? There is something fishy going on here.
On 9/9/21 13:54, Michal Hocko wrote: > On Wed 08-09-21 14:00:19, Mike Kravetz wrote: >> On 9/7/21 1:50 AM, Hillf Danton wrote: >> > On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote: >> > >> > And/or clamp reclaim retries for costly orders >> > >> > reclaim retries = MAX_RECLAIM_RETRIES - order; >> > >> > to pull down the chance for stall as low as possible. >> >> Thanks, and sorry for not replying quickly. I only get back to this as >> time allows. >> >> We could clamp the number of compaction and reclaim retries in >> __alloc_pages_slowpath as suggested. However, I noticed that a single >> reclaim call could take a bunch of time. As a result, I instrumented >> shrink_node to see what might be happening. Here is some information >> from a long stall. Note that I only dump stats when jiffies > 100000. >> >> [ 8136.874706] shrink_node: 507654 total jiffies, 3557110 tries >> [ 8136.881130] 130596341 reclaimed, 32 nr_to_reclaim >> [ 8136.887643] compaction_suitable results: >> [ 8136.893276] idx COMPACT_SKIPPED, 3557109 > > Can you get a more detailed break down of where the time is spent. Also > How come the number of reclaimed pages is so excessive comparing to the > reclaim target? There is something fishy going on here. I would say it's simply should_continue_reclaim() behaving similarly to should_compact_retry(). We'll get compaction_suitable() returning COMPACT_SKIPPED because the reclaimed pages have been immediately stolen, and compaction indicates there's not enough base pages to begin with to form a high-order pages. Since the stolen pages will appear on inactive lru, it seems to be worth continuing reclaim to make enough free base pages for compaction to no longer be skipped, because "inactive_lru_pages > pages_for_compaction" is true. So, both should_continue_reclaim() and should_compact_retry() are unable to recognize that reclaimed pages are being stolen and limit the retries in that case. The scenario seems to be uncommon, otherwise we'd be getting more reports of that.
On 9/9/21 6:45 AM, Vlastimil Babka wrote: > On 9/9/21 13:54, Michal Hocko wrote: >> On Wed 08-09-21 14:00:19, Mike Kravetz wrote: >>> On 9/7/21 1:50 AM, Hillf Danton wrote: >>>> On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote: >>>> >>>> And/or clamp reclaim retries for costly orders >>>> >>>> reclaim retries = MAX_RECLAIM_RETRIES - order; >>>> >>>> to pull down the chance for stall as low as possible. >>> >>> Thanks, and sorry for not replying quickly. I only get back to this as >>> time allows. >>> >>> We could clamp the number of compaction and reclaim retries in >>> __alloc_pages_slowpath as suggested. However, I noticed that a single >>> reclaim call could take a bunch of time. As a result, I instrumented >>> shrink_node to see what might be happening. Here is some information >>> from a long stall. Note that I only dump stats when jiffies > 100000. >>> >>> [ 8136.874706] shrink_node: 507654 total jiffies, 3557110 tries >>> [ 8136.881130] 130596341 reclaimed, 32 nr_to_reclaim >>> [ 8136.887643] compaction_suitable results: >>> [ 8136.893276] idx COMPACT_SKIPPED, 3557109 >> >> Can you get a more detailed break down of where the time is spent. Also >> How come the number of reclaimed pages is so excessive comparing to the >> reclaim target? There is something fishy going on here. > > I would say it's simply should_continue_reclaim() behaving similarly to > should_compact_retry(). We'll get compaction_suitable() returning > COMPACT_SKIPPED because the reclaimed pages have been immediately stolen, > and compaction indicates there's not enough base pages to begin with to form > a high-order pages. Since the stolen pages will appear on inactive lru, it > seems to be worth continuing reclaim to make enough free base pages for > compaction to no longer be skipped, because "inactive_lru_pages > > pages_for_compaction" is true. > > So, both should_continue_reclaim() and should_compact_retry() are unable to > recognize that reclaimed pages are being stolen and limit the retries in > that case. The scenario seems to be uncommon, otherwise we'd be getting more > reports of that. Yes, I believe this is what is happening. I honestly do not know if my test/recreation scenario is realistic. I do know that our DB team has had issues with allocating a number of hugetlb pages (after much uptime) taking forever or a REALLY long time. These are all 2MB huge page allocations, so going through the normal page allocation code path. No idea what else is running on the system at the time of the allocation stalls. Unfortunately, this can not be reproduced at will in their environment. As a result, I have no data and just this brief description of the issue. When I stumbled on an easy way to recreate, I thought it would be worth investigating/fixing. It certainly does not seem to be a common scenario.
On Thu 09-09-21 15:45:45, Vlastimil Babka wrote: > On 9/9/21 13:54, Michal Hocko wrote: > > On Wed 08-09-21 14:00:19, Mike Kravetz wrote: > >> On 9/7/21 1:50 AM, Hillf Danton wrote: > >> > On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote: > >> > > >> > And/or clamp reclaim retries for costly orders > >> > > >> > reclaim retries = MAX_RECLAIM_RETRIES - order; > >> > > >> > to pull down the chance for stall as low as possible. > >> > >> Thanks, and sorry for not replying quickly. I only get back to this as > >> time allows. > >> > >> We could clamp the number of compaction and reclaim retries in > >> __alloc_pages_slowpath as suggested. However, I noticed that a single > >> reclaim call could take a bunch of time. As a result, I instrumented > >> shrink_node to see what might be happening. Here is some information > >> from a long stall. Note that I only dump stats when jiffies > 100000. > >> > >> [ 8136.874706] shrink_node: 507654 total jiffies, 3557110 tries > >> [ 8136.881130] 130596341 reclaimed, 32 nr_to_reclaim > >> [ 8136.887643] compaction_suitable results: > >> [ 8136.893276] idx COMPACT_SKIPPED, 3557109 > > > > Can you get a more detailed break down of where the time is spent. Also > > How come the number of reclaimed pages is so excessive comparing to the > > reclaim target? There is something fishy going on here. > > I would say it's simply should_continue_reclaim() behaving similarly to > should_compact_retry(). We'll get compaction_suitable() returning > COMPACT_SKIPPED because the reclaimed pages have been immediately stolen, > and compaction indicates there's not enough base pages to begin with to form > a high-order pages. Since the stolen pages will appear on inactive lru, it > seems to be worth continuing reclaim to make enough free base pages for > compaction to no longer be skipped, because "inactive_lru_pages > > pages_for_compaction" is true. > > So, both should_continue_reclaim() and should_compact_retry() are unable to > recognize that reclaimed pages are being stolen and limit the retries in > that case. The scenario seems to be uncommon, otherwise we'd be getting more > reports of that. Thanks this sounds like a viable explanation. For some reason I have thought that a single reclaim round can take that much time but reading again it seems like a cumulative thing. I do agree that we should consider the above scenario when bailing out. It is simply unreasonable to reclaim so much memory without any forward progress. A more robust way to address this problem (which is not really new) would be to privatize reclaimed pages in the direct reclaim context and reuse them in the compaction so that it doesn't have to just pro-actively reclaim to keep watermarks good enough. A normal low order requests would benefit from that as well because the reclaim couldn't be stolen from them as well. Another approach would be to detect the situation and treat reclaim success which immediatelly leads to compaction deferral due to watermarks as a failure.
On 9/10/21 1:20 AM, Michal Hocko wrote: > On Thu 09-09-21 15:45:45, Vlastimil Babka wrote: >> >> I would say it's simply should_continue_reclaim() behaving similarly to >> should_compact_retry(). We'll get compaction_suitable() returning >> COMPACT_SKIPPED because the reclaimed pages have been immediately stolen, >> and compaction indicates there's not enough base pages to begin with to form >> a high-order pages. Since the stolen pages will appear on inactive lru, it >> seems to be worth continuing reclaim to make enough free base pages for >> compaction to no longer be skipped, because "inactive_lru_pages > >> pages_for_compaction" is true. >> >> So, both should_continue_reclaim() and should_compact_retry() are unable to >> recognize that reclaimed pages are being stolen and limit the retries in >> that case. The scenario seems to be uncommon, otherwise we'd be getting more >> reports of that. > > Thanks this sounds like a viable explanation. For some reason I have > thought that a single reclaim round can take that much time but reading > again it seems like a cumulative thing. > > I do agree that we should consider the above scenario when bailing out. > It is simply unreasonable to reclaim so much memory without any forward > progress. A very simple patch to bail early eliminated the LONG stalls and decreased the time of a bulk allocation request. Recall, the use case is converting 1G huge pages to the corresponding amount of 2MB huge pages. I have been told that 50GB is not an uncommon amount of pages to 'convert'. So, test case is free 50GB hugetlb pages followed by allocate 25600 2MB hugetlb pages. I have not done enough runs to get anything statistically valid, but here are some ballpark numbers. Unmodified baseline: ------------------- Unexpected number of 2MB pages after demote Expected 25600, have 19944 real 0m42.092s user 0m0.008s sys 0m41.467s Retries capped by patch below: ------------------------------ Unexpected number of 2MB pages after demote Expected 25600, have 19395 real 0m12.951s user 0m0.010s sys 0m12.840s The time of the operation is certainly better, but do note that we allocated 549 fewer pages. So, bailing early might cause some failures if we continue trying. It is all a tradeoff. One of the reasons for considering demote functionality is that the conversion would be done in the hugetlb code without getting the page allocator, recalim, compaction, etc involved. > A more robust way to address this problem (which is not really new) > would be to privatize reclaimed pages in the direct reclaim context and > reuse them in the compaction so that it doesn't have to just > pro-actively reclaim to keep watermarks good enough. A normal low order > requests would benefit from that as well because the reclaim couldn't be > stolen from them as well. That does sound interesting. Perhaps a longer term project. > Another approach would be to detect the situation and treat reclaim > success which immediatelly leads to compaction deferral due to > watermarks as a failure. I'll look into detecting and responding to this. Simple limit retries patch: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eeb3a9c..16f3055 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4896,6 +4896,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) int no_progress_loops; unsigned int cpuset_mems_cookie; int reserve_flags; + unsigned long tries = 0, max_tries = 0; /* * We also sanity check to catch abuse of atomic reserves being used by @@ -4904,6 +4905,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) == (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))) gfp_mask &= ~__GFP_ATOMIC; + if (order > PAGE_ALLOC_COSTLY_ORDER) + max_tries = 1Ul << order; retry_cpuset: compaction_retries = 0; @@ -4996,6 +4999,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) } retry: + tries++; /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac); @@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) if (did_some_progress > 0 && should_compact_retry(ac, order, alloc_flags, compact_result, &compact_priority, - &compaction_retries)) + &compaction_retries)) { + /* + * In one pathological case, pages can be stolen immediately + * after reclaimed. It looks like we are making progress, and + * compaction_retries is not incremented. This could cause + * an indefinite number of retries. Cap the number of retries + * for costly orders. + */ + if (max_tries && tries > max_tries) + goto nopage; goto retry; + } /* Deal with possible cpuset update races before we start OOM killing */ diff --git a/mm/vmscan.c b/mm/vmscan.c index eeae2f6..519e16e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2894,10 +2894,14 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) struct lruvec *target_lruvec; bool reclaimable = false; unsigned long file; + unsigned long tries = 0, max_tries = 0; target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); + if (sc->order > PAGE_ALLOC_COSTLY_ORDER) + max_tries = 1UL << sc->order; again: + tries++; memset(&sc->nr, 0, sizeof(sc->nr)); nr_reclaimed = sc->nr_reclaimed; @@ -3066,9 +3070,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) wait_iff_congested(BLK_RW_ASYNC, HZ/10); if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, - sc)) + sc)) { + /* + * In one pathological case, pages can be stolen immediately + * after being reclaimed. This can cause an indefinite number + * of retries. Limit the number of retries for costly orders. + */ + if (!current_is_kswapd() && + max_tries && tries > max_tries && + sc->nr_reclaimed > sc->nr_to_reclaim) + goto done; goto again; + } +done: /* * Kswapd gives up on balancing particular nodes after too * many failures to reclaim anything from them and goes to
On Fri, 10 Sep 2021 17:11:05 -0700 Mike Kravetz wrote: >On 9/10/21 1:20 AM, Michal Hocko wrote: >> On Thu 09-09-21 15:45:45, Vlastimil Babka wrote: >>> >>> I would say it's simply should_continue_reclaim() behaving similarly to >>> should_compact_retry(). We'll get compaction_suitable() returning >>> COMPACT_SKIPPED because the reclaimed pages have been immediately stolen, >>> and compaction indicates there's not enough base pages to begin with to form >>> a high-order pages. Since the stolen pages will appear on inactive lru, it >>> seems to be worth continuing reclaim to make enough free base pages for >>> compaction to no longer be skipped, because "inactive_lru_pages > >>> pages_for_compaction" is true. >>> >>> So, both should_continue_reclaim() and should_compact_retry() are unable to >>> recognize that reclaimed pages are being stolen and limit the retries in >>> that case. The scenario seems to be uncommon, otherwise we'd be getting more >>> reports of that. >> >> Thanks this sounds like a viable explanation. For some reason I have >> thought that a single reclaim round can take that much time but reading >> again it seems like a cumulative thing. >> >> I do agree that we should consider the above scenario when bailing out. >> It is simply unreasonable to reclaim so much memory without any forward >> progress. > >A very simple patch to bail early eliminated the LONG stalls and >decreased the time of a bulk allocation request. > >Recall, the use case is converting 1G huge pages to the corresponding >amount of 2MB huge pages. I have been told that 50GB is not an uncommon >amount of pages to 'convert'. So, test case is free 50GB hugetlb pages >followed by allocate 25600 2MB hugetlb pages. I have not done enough >runs to get anything statistically valid, but here are some ballpark >numbers. > >Unmodified baseline: >------------------- >Unexpected number of 2MB pages after demote > Expected 25600, have 19944 > >real 0m42.092s >user 0m0.008s >sys 0m41.467s > >Retries capped by patch below: >------------------------------ >Unexpected number of 2MB pages after demote > Expected 25600, have 19395 > >real 0m12.951s >user 0m0.010s >sys 0m12.840s > > >The time of the operation is certainly better, but do note that we >allocated 549 fewer pages. So, bailing early might cause some failures >if we continue trying. It is all a tradeoff. A hard one really to reach with all parties considerations met. >One of the reasons for >considering demote functionality is that the conversion would be done in >the hugetlb code without getting the page allocator, recalim, >compaction, etc involved. Particularly when this involvement is a must have. Given the role of stolen pages in the stall, a bisect run is throttle the concurrent bulk allocators before a hammer on the stall. static int hugetlb_2M_page_allocators; static DECLARE_WAIT_QUEUE_HEAD(hugetlb_2M_page_allocators_wq); static DEFINE_MUTEX(hugetlb_2M_page_allocators_mutex); static bool inc_hugetlb_2M_page_allocators(void) { bool rc = false; mutex_lock(&hugetlb_2M_page_allocators_mutex); if (hugetlb_2M_page_allocators < 2) { rc = true; ++hugetlb_2M_page_allocators; } mutex_unlock(&hugetlb_2M_page_allocators_mutex); return rc; } static void dec_hugetlb_2M_page_allocators(void) { mutex_lock(&hugetlb_2M_page_allocators_mutex); --hugetlb_2M_page_allocators; wake_up(&hugetlb_2M_page_allocators_wq); mutex_unlock(&hugetlb_2M_page_allocators_mutex); } static struct page *throttle_hugetlb_2M_page_allocators(void) { struct page *page; wait_event(hugetlb_2M_page_allocators_wq, true == inc_hugetlb_2M_page_allocators()); page = __alloc_pages_nodemask(gfp_mask, order2M, nid, nmask); dec_hugetlb_2M_page_allocators(); return page; } > >> A more robust way to address this problem (which is not really new) >> would be to privatize reclaimed pages in the direct reclaim context and >> reuse them in the compaction so that it doesn't have to just >> pro-actively reclaim to keep watermarks good enough. A normal low order >> requests would benefit from that as well because the reclaim couldn't be >> stolen from them as well. > >That does sound interesting. Perhaps a longer term project. > >> Another approach would be to detect the situation and treat reclaim >> success which immediatelly leads to compaction deferral due to >> watermarks as a failure. > >I'll look into detecting and responding to this. > >Simple limit retries patch: > >diff --git a/mm/page_alloc.c b/mm/page_alloc.c >index eeb3a9c..16f3055 100644 >--- a/mm/page_alloc.c >+++ b/mm/page_alloc.c >@@ -4896,6 +4896,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > int no_progress_loops; > unsigned int cpuset_mems_cookie; > int reserve_flags; >+ unsigned long tries = 0, max_tries = 0; > > /* > * We also sanity check to catch abuse of atomic reserves being used by >@@ -4904,6 +4905,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) == > (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))) > gfp_mask &= ~__GFP_ATOMIC; >+ if (order > PAGE_ALLOC_COSTLY_ORDER) >+ max_tries = 1Ul << order; > > retry_cpuset: > compaction_retries = 0; >@@ -4996,6 +4999,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > } > > retry: >+ tries++; > /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ > if (alloc_flags & ALLOC_KSWAPD) > wake_all_kswapds(order, gfp_mask, ac); >@@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > if (did_some_progress > 0 && > should_compact_retry(ac, order, alloc_flags, > compact_result, &compact_priority, >- &compaction_retries)) >+ &compaction_retries)) { >+ /* >+ * In one pathological case, pages can be stolen immediately >+ * after reclaimed. It looks like we are making progress, and >+ * compaction_retries is not incremented. This could cause >+ * an indefinite number of retries. Cap the number of retries >+ * for costly orders. >+ */ >+ if (max_tries && tries > max_tries) >+ goto nopage; > goto retry; >+ } > > > /* Deal with possible cpuset update races before we start OOM killing */ >diff --git a/mm/vmscan.c b/mm/vmscan.c >index eeae2f6..519e16e 100644 >--- a/mm/vmscan.c >+++ b/mm/vmscan.c >@@ -2894,10 +2894,14 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > struct lruvec *target_lruvec; > bool reclaimable = false; > unsigned long file; >+ unsigned long tries = 0, max_tries = 0; > > target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); >+ if (sc->order > PAGE_ALLOC_COSTLY_ORDER) >+ max_tries = 1UL << sc->order; > > again: >+ tries++; > memset(&sc->nr, 0, sizeof(sc->nr)); > > nr_reclaimed = sc->nr_reclaimed; >@@ -3066,9 +3070,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > wait_iff_congested(BLK_RW_ASYNC, HZ/10); > > if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, >- sc)) >+ sc)) { >+ /* >+ * In one pathological case, pages can be stolen immediately >+ * after being reclaimed. This can cause an indefinite number >+ * of retries. Limit the number of retries for costly orders. >+ */ >+ if (!current_is_kswapd() && >+ max_tries && tries > max_tries && >+ sc->nr_reclaimed > sc->nr_to_reclaim) >+ goto done; > goto again; >+ } > >+done: > /* > * Kswapd gives up on balancing particular nodes after too > * many failures to reclaim anything from them and goes to > > >-- >Mike Kravetz
On Fri 10-09-21 17:11:05, Mike Kravetz wrote: [...] > @@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > if (did_some_progress > 0 && > should_compact_retry(ac, order, alloc_flags, > compact_result, &compact_priority, > - &compaction_retries)) > + &compaction_retries)) { > + /* > + * In one pathological case, pages can be stolen immediately > + * after reclaimed. It looks like we are making progress, and > + * compaction_retries is not incremented. This could cause > + * an indefinite number of retries. Cap the number of retries > + * for costly orders. > + */ > + if (max_tries && tries > max_tries) > + goto nopage; > goto retry; > + } I do not think this is a good approach. We do not want to play with retries numbers. If we want to go with a minimal change for now then the compaction feedback mechanism should track the number of reclaimed pages to satisfy watermarks and if that grows beyond reasonable (proportionaly to the request size) then simply give up rather than keep trying again and again.
On 9/13/21 8:50 AM, Michal Hocko wrote: > On Fri 10-09-21 17:11:05, Mike Kravetz wrote: > [...] >> @@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) >> if (did_some_progress > 0 && >> should_compact_retry(ac, order, alloc_flags, >> compact_result, &compact_priority, >> - &compaction_retries)) >> + &compaction_retries)) { >> + /* >> + * In one pathological case, pages can be stolen immediately >> + * after reclaimed. It looks like we are making progress, and >> + * compaction_retries is not incremented. This could cause >> + * an indefinite number of retries. Cap the number of retries >> + * for costly orders. >> + */ >> + if (max_tries && tries > max_tries) >> + goto nopage; >> goto retry; >> + } > > I do not think this is a good approach. We do not want to play with > retries numbers. If we want to go with a minimal change for now then the > compaction feedback mechanism should track the number of reclaimed pages > to satisfy watermarks and if that grows beyond reasonable (proportionaly > to the request size) then simply give up rather than keep trying again > and again. I am experimenting with code similar to the patch below. The idea is to key off of 'COMPACT_SKIPPED' being returned. This implies that not enough pages are available for compaction. One of the values in this calculation is compact_gap() which is scaled based on allocation order. The simple patch is to give up if number of reclaimed pages is greater than compact_gap(). The thought is that this implies pages are being stolen before compaction. Such a patch does help in my specific test. However, the concern is that giving up earlier may regress some workloads. One could of course come up with a scenario where one more retry would result in success. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eeb3a9c..f8e3b0e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4413,6 +4413,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) static inline bool should_compact_retry(struct alloc_context *ac, int order, int alloc_flags, + unsigned long nr_reclaimed, enum compact_result compact_result, enum compact_priority *compact_priority, int *compaction_retries) @@ -4445,7 +4446,16 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) * to work with, so we retry only if it looks like reclaim can help. */ if (compaction_needs_reclaim(compact_result)) { - ret = compaction_zonelist_suitable(ac, order, alloc_flags); + /* + * If we reclaimed enough pages, but they are not available + * now, this implies they were stolen. Do not reclaim again + * as this could go on indefinitely. + */ + if (nr_reclaimed > compact_gap(order)) + ret = false; + else + ret = compaction_zonelist_suitable(ac, order, + alloc_flags); goto out; } @@ -4504,6 +4514,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) static inline bool should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags, enum compact_result compact_result, + unsigned long nr_reclaimed, enum compact_priority *compact_priority, int *compaction_retries) { @@ -4890,6 +4901,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) struct page *page = NULL; unsigned int alloc_flags; unsigned long did_some_progress; + unsigned long nr_reclaimed; enum compact_priority compact_priority; enum compact_result compact_result; int compaction_retries; @@ -5033,6 +5045,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) &did_some_progress); if (page) goto got_pg; + nr_reclaimed = did_some_progress; /* Try direct compaction and then allocating */ page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac, @@ -5063,7 +5076,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) */ if (did_some_progress > 0 && should_compact_retry(ac, order, alloc_flags, - compact_result, &compact_priority, + nr_reclaimed, compact_result, &compact_priority, &compaction_retries)) goto retry; diff --git a/mm/vmscan.c b/mm/vmscan.c index eeae2f6..d40eca5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2819,10 +2819,18 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat, } /* + * If we have reclaimed enough pages for compaction, and compaction + * is not ready, this implies pages are being stolen as they are being + * reclaimed. Quit now instead of continual retrying. + */ + pages_for_compaction = compact_gap(sc->order); + if (!current_is_kswapd() && sc->nr_reclaimed > pages_for_compaction) + return false; + + /* * If we have not reclaimed enough pages for compaction and the * inactive lists are large enough, continue reclaiming */ - pages_for_compaction = compact_gap(sc->order); inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE); if (get_nr_swap_pages() > 0) inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
On 9/15/21 9:57 AM, Mike Kravetz wrote: > On 9/13/21 8:50 AM, Michal Hocko wrote: >> I do not think this is a good approach. We do not want to play with >> retries numbers. If we want to go with a minimal change for now then the >> compaction feedback mechanism should track the number of reclaimed pages >> to satisfy watermarks and if that grows beyond reasonable (proportionaly >> to the request size) then simply give up rather than keep trying again >> and again. > > I am experimenting with code similar to the patch below. The idea is to > key off of 'COMPACT_SKIPPED' being returned. This implies that not enough > pages are available for compaction. One of the values in this calculation > is compact_gap() which is scaled based on allocation order. The simple > patch is to give up if number of reclaimed pages is greater than > compact_gap(). The thought is that this implies pages are being stolen > before compaction. > > Such a patch does help in my specific test. However, the concern is > that giving up earlier may regress some workloads. One could of course > come up with a scenario where one more retry would result in success. With the patch below, I instrumented the number of times shrink_node and __alloc_pages_slowpath would 'quit earlier than before'. In my test (free 1GB pages, allocate 2MB pages), this early exit was exercised a significant number of times: shrink_node: 931124 quick exits __alloc_pages_slowpath: bail early 32 times For comparison, I ran the LTP mm tests mostly because they include a few OOM test scenarios. Test results were unchanged, and numbers for exiting early were much lower than my hugetlb test: shrink_node: 57 quick exits __alloc_pages_slowpath: bail early 0 times This at least 'looks' like the changes target the page stealing corner case in my test, and minimally impact other scenarios. Any suggestions for other tests to run, or tweaks to the code?