Message ID | 20200228033819.3857058-1-ying.huang@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: Discard lazily freed pages when migrating | expand |
On Fri, Feb 28, 2020 at 11:38:16AM +0800, Huang, Ying wrote: > MADV_FREE is a lazy free mechanism in Linux. According to the manpage > of mavise(2), the semantics of MADV_FREE is, > > The application no longer requires the pages in the range specified > by addr and len. The kernel can thus free these pages, but the > freeing could be delayed until memory pressure occurs. ... > > Originally, the pages freed lazily by MADV_FREE will only be freed > really by page reclaiming when there is memory pressure or when > unmapping the address range. In addition to that, there's another > opportunity to free these pages really, when we try to migrate them. > > The main value to do that is to avoid to create the new memory > pressure immediately if possible. Instead, even if the pages are > required again, they will be allocated gradually on demand. That is, > the memory will be allocated lazily when necessary. This follows the > common philosophy in the Linux kernel, allocate resources lazily on > demand. Do you have an example program which does this (and so benefits)? If so, can you quantify the benefit at all?
Hi, Matthew, Matthew Wilcox <willy@infradead.org> writes: > On Fri, Feb 28, 2020 at 11:38:16AM +0800, Huang, Ying wrote: >> MADV_FREE is a lazy free mechanism in Linux. According to the manpage >> of mavise(2), the semantics of MADV_FREE is, >> >> The application no longer requires the pages in the range specified >> by addr and len. The kernel can thus free these pages, but the >> freeing could be delayed until memory pressure occurs. ... >> >> Originally, the pages freed lazily by MADV_FREE will only be freed >> really by page reclaiming when there is memory pressure or when >> unmapping the address range. In addition to that, there's another >> opportunity to free these pages really, when we try to migrate them. >> >> The main value to do that is to avoid to create the new memory >> pressure immediately if possible. Instead, even if the pages are >> required again, they will be allocated gradually on demand. That is, >> the memory will be allocated lazily when necessary. This follows the >> common philosophy in the Linux kernel, allocate resources lazily on >> demand. > > Do you have an example program which does this (and so benefits)? Sorry, what do you mean exactly for "this" here? Call madvise(,,MADV_FREE)? Or migrate pages? > If so, can you quantify the benefit at all? The question is what is the right workload? For example, I can build a scenario as below to show benefit. - run program A in node 0 with many lazily freed pages - run program B in node 1, so that the free memory on node 1 is low - migrate the program A from node 0 to node 1, so that the program B is influenced by the memory pressure created by migrating lazily freed pages. Best Regards, Huang, Ying
On 28.02.20 08:25, Huang, Ying wrote: > Hi, Matthew, > > Matthew Wilcox <willy@infradead.org> writes: > >> On Fri, Feb 28, 2020 at 11:38:16AM +0800, Huang, Ying wrote: >>> MADV_FREE is a lazy free mechanism in Linux. According to the manpage >>> of mavise(2), the semantics of MADV_FREE is, >>> >>> The application no longer requires the pages in the range specified >>> by addr and len. The kernel can thus free these pages, but the >>> freeing could be delayed until memory pressure occurs. ... >>> >>> Originally, the pages freed lazily by MADV_FREE will only be freed >>> really by page reclaiming when there is memory pressure or when >>> unmapping the address range. In addition to that, there's another >>> opportunity to free these pages really, when we try to migrate them. >>> >>> The main value to do that is to avoid to create the new memory >>> pressure immediately if possible. Instead, even if the pages are >>> required again, they will be allocated gradually on demand. That is, >>> the memory will be allocated lazily when necessary. This follows the >>> common philosophy in the Linux kernel, allocate resources lazily on >>> demand. >> >> Do you have an example program which does this (and so benefits)? > > Sorry, what do you mean exactly for "this" here? Call > madvise(,,MADV_FREE)? Or migrate pages? > >> If so, can you quantify the benefit at all? > > The question is what is the right workload? For example, I can build a > scenario as below to show benefit. We usually don't optimize for theoretical issues. Is there a real-life workload you are trying to optimize this code for? > > - run program A in node 0 with many lazily freed pages > > - run program B in node 1, so that the free memory on node 1 is low > > - migrate the program A from node 0 to node 1, so that the program B is > influenced by the memory pressure created by migrating lazily freed > pages. > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will report currently free pages to the hypervisor, which will MADV_FREE the reported memory. As long as there is no memory pressure, there is no need to actually free the pages. Once the guest reuses such a page, it could happen that there is still the old page and pulling in in a fresh (zeroed) page can be avoided. AFAIKs, after your change, we would get more pages discarded from our guest, resulting in more fresh (zeroed) pages having to be pulled in when a guest touches a reported free page again. But OTOH, page migration is speed up (avoiding to migrate these pages). However, one important question, will you always discard memory when migrating pages, or only if there is memory pressure on the migration target?
David Hildenbrand <david@redhat.com> writes: > On 28.02.20 08:25, Huang, Ying wrote: >> Hi, Matthew, >> >> Matthew Wilcox <willy@infradead.org> writes: >> >>> On Fri, Feb 28, 2020 at 11:38:16AM +0800, Huang, Ying wrote: >>>> MADV_FREE is a lazy free mechanism in Linux. According to the manpage >>>> of mavise(2), the semantics of MADV_FREE is, >>>> >>>> The application no longer requires the pages in the range specified >>>> by addr and len. The kernel can thus free these pages, but the >>>> freeing could be delayed until memory pressure occurs. ... >>>> >>>> Originally, the pages freed lazily by MADV_FREE will only be freed >>>> really by page reclaiming when there is memory pressure or when >>>> unmapping the address range. In addition to that, there's another >>>> opportunity to free these pages really, when we try to migrate them. >>>> >>>> The main value to do that is to avoid to create the new memory >>>> pressure immediately if possible. Instead, even if the pages are >>>> required again, they will be allocated gradually on demand. That is, >>>> the memory will be allocated lazily when necessary. This follows the >>>> common philosophy in the Linux kernel, allocate resources lazily on >>>> demand. >>> >>> Do you have an example program which does this (and so benefits)? >> >> Sorry, what do you mean exactly for "this" here? Call >> madvise(,,MADV_FREE)? Or migrate pages? >> >>> If so, can you quantify the benefit at all? >> >> The question is what is the right workload? For example, I can build a >> scenario as below to show benefit. > > We usually don't optimize for theoretical issues. Is there a real-life > workload you are trying to optimize this code for? We don't use a specific workload because we thought this is a general optimization. I will explain more later in this email. >> >> - run program A in node 0 with many lazily freed pages >> >> - run program B in node 1, so that the free memory on node 1 is low >> >> - migrate the program A from node 0 to node 1, so that the program B is >> influenced by the memory pressure created by migrating lazily freed >> pages. >> > > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will > report currently free pages to the hypervisor, which will MADV_FREE the > reported memory. As long as there is no memory pressure, there is no > need to actually free the pages. Once the guest reuses such a page, it > could happen that there is still the old page and pulling in in a fresh > (zeroed) page can be avoided. > > AFAIKs, after your change, we would get more pages discarded from our > guest, resulting in more fresh (zeroed) pages having to be pulled in > when a guest touches a reported free page again. But OTOH, page > migration is speed up (avoiding to migrate these pages). Let's look at this problem in another perspective. To migrate the MADV_FREE pages of the QEMU process from the node A to the node B, we need to free the original pages in the node A, and (maybe) allocate the same number of pages in the node B. So the question becomes - we may need to allocate some pages in the node B - these pages may be accessed by the application or not - we should allocate all these pages in advance or allocate them lazily when they are accessed. We thought the common philosophy in Linux kernel is to allocate lazily. That is, because we will always free the original pages in the node A, the question isn't whether we should free these MADV_FREE pages, but whether we should allocate the same number of pages in the node B before we know whether they are really needed. We thought this is similar as whether we should allocate all physical pages when mmap(). > However, one important question, will you always discard memory when > migrating pages, or only if there is memory pressure on the migration > target? We will always discard memory when migrating pages. Our reasoning is as above. Best Regards, Huang, Ying
On Fri, Feb 28, 2020 at 04:55:40PM +0800, Huang, Ying wrote: > > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will > > report currently free pages to the hypervisor, which will MADV_FREE the > > reported memory. As long as there is no memory pressure, there is no > > need to actually free the pages. Once the guest reuses such a page, it > > could happen that there is still the old page and pulling in in a fresh > > (zeroed) page can be avoided. > > > > AFAIKs, after your change, we would get more pages discarded from our > > guest, resulting in more fresh (zeroed) pages having to be pulled in > > when a guest touches a reported free page again. But OTOH, page > > migration is speed up (avoiding to migrate these pages). > > Let's look at this problem in another perspective. To migrate the > MADV_FREE pages of the QEMU process from the node A to the node B, we > need to free the original pages in the node A, and (maybe) allocate the > same number of pages in the node B. So the question becomes > > - we may need to allocate some pages in the node B > - these pages may be accessed by the application or not > - we should allocate all these pages in advance or allocate them lazily > when they are accessed. > > We thought the common philosophy in Linux kernel is to allocate lazily. > I also think there needs to be an example of a real application that benefits from this new behaviour. Consider the possible sources of page migration 1. NUMA balancing -- The application has to read/write the data for this to trigger. In the case of write, MADV_FREE is cancelled and it's mostly likely going to be a write unless it's an application bug. 2. sys_movepages -- the application has explictly stated the data is in use on a particular node yet any MADV_FREE page gets discarded 3. Compaction -- there may be no memory pressure at all but the MADV_FREE memory is discarded prematurely In the first case, the data is explicitly in use, most likely due to a write in which case it's inappropriate to discard. Discarding and reallocating a zero'd page is not expected. In second case, the data is likely in use or else why would the system call be used? In the third case the timing of when MADV_FREE pages disappear is arbitrary as it can happen without any actual memory pressure. This may or may not be problematic but it leads to unpredictable latencies for applications that use MADV_FREE for a quick malloc/free implementation. Before, as long as there is no pressure, the reuse of a MADV_FREE incurs just a small penalty but now with compaction it does not matter if the system avoids memory pressure because they may still incur a fault to allocate and zero a new page. There is a hypothetical fourth case which I only mention because of your email address. If persistent memory is ever used for tiered memory then MADV_FREE pages that migrate from dram to pmem gets discarded instead of migrated. When it's reused, it gets reallocated from dram regardless of whether that region is hot or not. This may lead to an odd scenario whereby applications occupy dram prematurely due to a single reference of a MADV_FREE page. It's all subtle enough that we really should have an example application in mind that benefits so we can weigh the benefits against the potential risks.
On Fri 28-02-20 16:55:40, Huang, Ying wrote: > David Hildenbrand <david@redhat.com> writes: [...] > > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will > > report currently free pages to the hypervisor, which will MADV_FREE the > > reported memory. As long as there is no memory pressure, there is no > > need to actually free the pages. Once the guest reuses such a page, it > > could happen that there is still the old page and pulling in in a fresh > > (zeroed) page can be avoided. > > > > AFAIKs, after your change, we would get more pages discarded from our > > guest, resulting in more fresh (zeroed) pages having to be pulled in > > when a guest touches a reported free page again. But OTOH, page > > migration is speed up (avoiding to migrate these pages). > > Let's look at this problem in another perspective. To migrate the > MADV_FREE pages of the QEMU process from the node A to the node B, we > need to free the original pages in the node A, and (maybe) allocate the > same number of pages in the node B. So the question becomes > > - we may need to allocate some pages in the node B > - these pages may be accessed by the application or not > - we should allocate all these pages in advance or allocate them lazily > when they are accessed. > > We thought the common philosophy in Linux kernel is to allocate lazily. The common philosophy is to cache as much as possible. And MADV_FREE pages are a kind of cache as well. If the target node is short on memory then those will be reclaimed as a cache so a pro-active freeing sounds counter productive as you do not have any idea whether that cache is going to be used in future. In other words you are not going to free a clean page cache if you want to use that memory as a migration target right? So you should make a clear case about why MADV_FREE cache is less important than the clean page cache and ideally have a good justification backed by real workloads.
On Fri 28-02-20 10:50:50, Michal Hocko wrote: > On Fri 28-02-20 16:55:40, Huang, Ying wrote: > > David Hildenbrand <david@redhat.com> writes: > [...] > > > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will > > > report currently free pages to the hypervisor, which will MADV_FREE the > > > reported memory. As long as there is no memory pressure, there is no > > > need to actually free the pages. Once the guest reuses such a page, it > > > could happen that there is still the old page and pulling in in a fresh > > > (zeroed) page can be avoided. > > > > > > AFAIKs, after your change, we would get more pages discarded from our > > > guest, resulting in more fresh (zeroed) pages having to be pulled in > > > when a guest touches a reported free page again. But OTOH, page > > > migration is speed up (avoiding to migrate these pages). > > > > Let's look at this problem in another perspective. To migrate the > > MADV_FREE pages of the QEMU process from the node A to the node B, we > > need to free the original pages in the node A, and (maybe) allocate the > > same number of pages in the node B. So the question becomes > > > > - we may need to allocate some pages in the node B > > - these pages may be accessed by the application or not > > - we should allocate all these pages in advance or allocate them lazily > > when they are accessed. > > > > We thought the common philosophy in Linux kernel is to allocate lazily. > > The common philosophy is to cache as much as possible. And MADV_FREE > pages are a kind of cache as well. If the target node is short on memory > then those will be reclaimed as a cache so a pro-active freeing sounds > counter productive as you do not have any idea whether that cache is > going to be used in future. In other words you are not going to free a > clean page cache if you want to use that memory as a migration target sorry I meant to say that you are not going to clean page cache when migrating it. > right? So you should make a clear case about why MADV_FREE cache is less > important than the clean page cache and ideally have a good > justification backed by real workloads.
On Fri, Feb 28, 2020 at 10:50:48AM +0100, Michal Hocko wrote: > On Fri 28-02-20 16:55:40, Huang, Ying wrote: > > David Hildenbrand <david@redhat.com> writes: > [...] > > > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will > > > report currently free pages to the hypervisor, which will MADV_FREE the > > > reported memory. As long as there is no memory pressure, there is no > > > need to actually free the pages. Once the guest reuses such a page, it > > > could happen that there is still the old page and pulling in in a fresh > > > (zeroed) page can be avoided. > > > > > > AFAIKs, after your change, we would get more pages discarded from our > > > guest, resulting in more fresh (zeroed) pages having to be pulled in > > > when a guest touches a reported free page again. But OTOH, page > > > migration is speed up (avoiding to migrate these pages). > > > > Let's look at this problem in another perspective. To migrate the > > MADV_FREE pages of the QEMU process from the node A to the node B, we > > need to free the original pages in the node A, and (maybe) allocate the > > same number of pages in the node B. So the question becomes > > > > - we may need to allocate some pages in the node B > > - these pages may be accessed by the application or not > > - we should allocate all these pages in advance or allocate them lazily > > when they are accessed. > > > > We thought the common philosophy in Linux kernel is to allocate lazily. > > The common philosophy is to cache as much as possible. And MADV_FREE > pages are a kind of cache as well. If the target node is short on memory > then those will be reclaimed as a cache so a pro-active freeing sounds > counter productive as you do not have any idea whether that cache is > going to be used in future. In other words you are not going to free a > clean page cache if you want to use that memory as a migration target > right? So you should make a clear case about why MADV_FREE cache is less > important than the clean page cache and ideally have a good > justification backed by real workloads. Agreed. MADV_FREE says that the *data* in the pages is no longer needed, and so the pages are cheaper to reclaim than regular anon pages (swap). But people use MADV_FREE in the hope that they can reuse the pages at a later point - otherwise, they'd unmap them. We should retain them until the memory is actively needed for other allocations.
Hi, Mel, Mel Gorman <mgorman@suse.de> writes: > On Fri, Feb 28, 2020 at 04:55:40PM +0800, Huang, Ying wrote: >> > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will >> > report currently free pages to the hypervisor, which will MADV_FREE the >> > reported memory. As long as there is no memory pressure, there is no >> > need to actually free the pages. Once the guest reuses such a page, it >> > could happen that there is still the old page and pulling in in a fresh >> > (zeroed) page can be avoided. >> > >> > AFAIKs, after your change, we would get more pages discarded from our >> > guest, resulting in more fresh (zeroed) pages having to be pulled in >> > when a guest touches a reported free page again. But OTOH, page >> > migration is speed up (avoiding to migrate these pages). >> >> Let's look at this problem in another perspective. To migrate the >> MADV_FREE pages of the QEMU process from the node A to the node B, we >> need to free the original pages in the node A, and (maybe) allocate the >> same number of pages in the node B. So the question becomes >> >> - we may need to allocate some pages in the node B >> - these pages may be accessed by the application or not >> - we should allocate all these pages in advance or allocate them lazily >> when they are accessed. >> >> We thought the common philosophy in Linux kernel is to allocate lazily. >> > > I also think there needs to be an example of a real application that > benefits from this new behaviour. Consider the possible sources of page > migration > > 1. NUMA balancing -- The application has to read/write the data for this > to trigger. In the case of write, MADV_FREE is cancelled and it's > mostly likely going to be a write unless it's an application bug. > 2. sys_movepages -- the application has explictly stated the data is in > use on a particular node yet any MADV_FREE page gets discarded > 3. Compaction -- there may be no memory pressure at all but the > MADV_FREE memory is discarded prematurely > > In the first case, the data is explicitly in use, most likely due to > a write in which case it's inappropriate to discard. Discarding and > reallocating a zero'd page is not expected. If my understanding were correct, NUMA balancing will not try to migrate clean MADV_FREE pages most of the times. Because the lazily freed pages may be zeroed at any time, it makes almost no sense to read them. So, the first access after being freed lazily should be writing. That will make the page dirty MADV_FREE, so will not be discarded. And the first writing access in a new node will not trigger migration usually because of the two-stage filter in should_numa_migrate_memory(). So current behavior in case 1 will not change most of the times after the patchset. > In second case, the data is likely in use or else why would the system > call be used? If the pages were in use, they shouldn't be clean MADV_FREE pages. So no behavior change after this patchset. There is at least one use case to move the clean MADV_FREE pages (that is, not in use), when we move an application from one node to another node via something like /usr/bin/migratepages. I think this is kind of NUMA balancing by hand. It appears reasonable to discard clean MADV_FREE pages at least in some situations. > In the third case the timing of when MADV_FREE pages disappear is > arbitrary as it can happen without any actual memory pressure. This > may or may not be problematic but it leads to unpredictable latencies > for applications that use MADV_FREE for a quick malloc/free > implementation. Before, as long as there is no pressure, the reuse of > a MADV_FREE incurs just a small penalty but now with compaction it > does not matter if the system avoids memory pressure because they may > still incur a fault to allocate and zero a new page. Yes. This is a real problem. Previously we thought that the migration is kind of - unmap and free the old pages - map the new pages If we can allocate new pages lazily in mmap(), why cannot we allocate new pages lazily in migrating pages too if possible (for clean MADV_CLEAN pages) because its second stage is kind of mmap() too. But you remind us that there are some differences, - mmap() is called by the application directly, so its effect is predictable. While some migration like that in the compaction isn't called by the application, so may have unpredictable behavior. - With mmap(), the application can choose to allocate new pages lazily or eagerly. But we lacks the same mechanism in this patchset. So, maybe we can make the mechanism more flexible. That is, let the administrator or the application choose the right behavior for them, via some system level configuration knob or API flags. For example, if the memory allocation latency isn't critical for the workloads, the administrator can choose to discard clean MADV_FREE pages to make compaction quicker and easier? > There is a hypothetical fourth case which I only mention because of your > email address. If persistent memory is ever used for tiered memory then > MADV_FREE pages that migrate from dram to pmem gets discarded instead > of migrated. When it's reused, it gets reallocated from dram regardless > of whether that region is hot or not. Thanks for mentioning this use case because we are working on this too. Firstly, the memory writing bandwidth of the persistent memory is much less than its reading bandwidth, so it is very valuable to avoid writing persistent memory unnecessarily. Discarding instead of migrating here can help saving the memory writing bandwidth of the persistent memory. Secondly, we will always allocate pages in DRAM before knowing they are hot or cold if possible, and only demote the known cold pages to the persistent memory. Only when the accessing pattern of these cold pages changes to be hot, we will promote them back to DRAM. In this way, the always hot pages can be kept in DRAM from begin to end. After MADV_FREE, the contents may be zeroed, so its access pattern changes to be unknown too (e.g. free an object, then allocate a different object). If we want to follow the above rule, we should put the new page with unknown temperature in DRAM to give it an opportunity. > This may lead to an odd scenario whereby applications occupy dram > prematurely due to a single reference of a MADV_FREE page. The better way to deal with this is to enhance mem_cgroup to limit DRAM usage in a memory tiering system? > It's all subtle enough that we really should have an example application > in mind that benefits so we can weigh the benefits against the potential > risks. If some applications cannot tolerate the latency incurred by the memory allocation and zeroing. Then we cannot discard instead of migrate always. While in some situations, less memory pressure can help. So it's better to let the administrator and the application choose the right behavior in the specific situation? Best Regards, Huang, Ying
Michal Hocko <mhocko@kernel.org> writes: > On Fri 28-02-20 16:55:40, Huang, Ying wrote: >> David Hildenbrand <david@redhat.com> writes: > [...] >> > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will >> > report currently free pages to the hypervisor, which will MADV_FREE the >> > reported memory. As long as there is no memory pressure, there is no >> > need to actually free the pages. Once the guest reuses such a page, it >> > could happen that there is still the old page and pulling in in a fresh >> > (zeroed) page can be avoided. >> > >> > AFAIKs, after your change, we would get more pages discarded from our >> > guest, resulting in more fresh (zeroed) pages having to be pulled in >> > when a guest touches a reported free page again. But OTOH, page >> > migration is speed up (avoiding to migrate these pages). >> >> Let's look at this problem in another perspective. To migrate the >> MADV_FREE pages of the QEMU process from the node A to the node B, we >> need to free the original pages in the node A, and (maybe) allocate the >> same number of pages in the node B. So the question becomes >> >> - we may need to allocate some pages in the node B >> - these pages may be accessed by the application or not >> - we should allocate all these pages in advance or allocate them lazily >> when they are accessed. >> >> We thought the common philosophy in Linux kernel is to allocate lazily. > > The common philosophy is to cache as much as possible. Yes. This is another common philosophy. And MADV_FREE pages is different from caches such as the page caches because it has no valid contents. And this patchset doesn't disable MADV_FREE mechanism. It just change the migration behavior. So MADV_FREE pages will be kept until reclaiming most of the times. > And MADV_FREE pages are a kind of cache as well. If the target node is > short on memory then those will be reclaimed as a cache so a > pro-active freeing sounds counter productive as you do not have any > idea whether that cache is going to be used in future. In other words > you are not going to free a clean page cache if you want to use that > memory as a migration target right? So you should make a clear case > about why MADV_FREE cache is less important than the clean page cache > and ideally have a good justification backed by real workloads. Clean page cache still have valid contents, while clean MADV_FREE pages has no valid contents. So penalty of discarding the clean page cache is reading from disk, while the penalty of discarding clean MADV_FREE pages is just page allocation and zeroing. I understand that MADV_FREE is another kind of cache and has its value. But in the original implementation, during migration, we have already freed the original "cache", then reallocate the cache elsewhere and copy. This appears more like all pages are populated in mmap() always. I know there's value to populate all pages in mmap(), but does that need to be done always by default? Best Regards, Huang, Ying
On 02.03.20 15:12, Huang, Ying wrote: > Michal Hocko <mhocko@kernel.org> writes: > >> On Fri 28-02-20 16:55:40, Huang, Ying wrote: >>> David Hildenbrand <david@redhat.com> writes: >> [...] >>>> E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will >>>> report currently free pages to the hypervisor, which will MADV_FREE the >>>> reported memory. As long as there is no memory pressure, there is no >>>> need to actually free the pages. Once the guest reuses such a page, it >>>> could happen that there is still the old page and pulling in in a fresh >>>> (zeroed) page can be avoided. >>>> >>>> AFAIKs, after your change, we would get more pages discarded from our >>>> guest, resulting in more fresh (zeroed) pages having to be pulled in >>>> when a guest touches a reported free page again. But OTOH, page >>>> migration is speed up (avoiding to migrate these pages). >>> >>> Let's look at this problem in another perspective. To migrate the >>> MADV_FREE pages of the QEMU process from the node A to the node B, we >>> need to free the original pages in the node A, and (maybe) allocate the >>> same number of pages in the node B. So the question becomes >>> >>> - we may need to allocate some pages in the node B >>> - these pages may be accessed by the application or not >>> - we should allocate all these pages in advance or allocate them lazily >>> when they are accessed. >>> >>> We thought the common philosophy in Linux kernel is to allocate lazily. >> >> The common philosophy is to cache as much as possible. > > Yes. This is another common philosophy. And MADV_FREE pages is > different from caches such as the page caches because it has no valid > contents. Side note: It might contain valid content until discarded/zeroed out. E.g., an application could use a marker bit (e.g., first bit) to detect if the page still contains valid data or not. If the data is still marked valid, the content could be reuse immediately. Not sure if there is any such application, though :)
On Mon 02-03-20 22:12:53, Huang, Ying wrote: > Michal Hocko <mhocko@kernel.org> writes: [...] > > And MADV_FREE pages are a kind of cache as well. If the target node is > > short on memory then those will be reclaimed as a cache so a > > pro-active freeing sounds counter productive as you do not have any > > idea whether that cache is going to be used in future. In other words > > you are not going to free a clean page cache if you want to use that > > memory as a migration target right? So you should make a clear case > > about why MADV_FREE cache is less important than the clean page cache > > and ideally have a good justification backed by real workloads. > > Clean page cache still have valid contents, while clean MADV_FREE pages > has no valid contents. So penalty of discarding the clean page cache is > reading from disk, while the penalty of discarding clean MADV_FREE pages > is just page allocation and zeroing. And "just page allocation and zeroing" overhead is the primary motivation to keep the page in memory. It is a decision of the workload to use MADV_FREE because chances are that this will speed things up. All that with a contract that the memory goes away under memory pressure so with a good workload/memory sizing you do not really lose that optimization. Now you want to make decision on behalf of the consumer of the MADV_FREE memory. > I understand that MADV_FREE is another kind of cache and has its value. > But in the original implementation, during migration, we have already > freed the original "cache", then reallocate the cache elsewhere and > copy. This appears more like all pages are populated in mmap() always. > I know there's value to populate all pages in mmap(), but does that need > to be done always by default? It is not. You have to explicitly request MAP_POPULATE to initialize mmap.
On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote: > Mel Gorman <mgorman@suse.de> writes: > > > On Fri, Feb 28, 2020 at 04:55:40PM +0800, Huang, Ying wrote: > >> > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will > >> > report currently free pages to the hypervisor, which will MADV_FREE the > >> > reported memory. As long as there is no memory pressure, there is no > >> > need to actually free the pages. Once the guest reuses such a page, it > >> > could happen that there is still the old page and pulling in in a fresh > >> > (zeroed) page can be avoided. > >> > > >> > AFAIKs, after your change, we would get more pages discarded from our > >> > guest, resulting in more fresh (zeroed) pages having to be pulled in > >> > when a guest touches a reported free page again. But OTOH, page > >> > migration is speed up (avoiding to migrate these pages). > >> > >> Let's look at this problem in another perspective. To migrate the > >> MADV_FREE pages of the QEMU process from the node A to the node B, we > >> need to free the original pages in the node A, and (maybe) allocate the > >> same number of pages in the node B. So the question becomes > >> > >> - we may need to allocate some pages in the node B > >> - these pages may be accessed by the application or not > >> - we should allocate all these pages in advance or allocate them lazily > >> when they are accessed. > >> > >> We thought the common philosophy in Linux kernel is to allocate lazily. > >> > > > > I also think there needs to be an example of a real application that > > benefits from this new behaviour. Consider the possible sources of page > > migration > > > > 1. NUMA balancing -- The application has to read/write the data for this > > to trigger. In the case of write, MADV_FREE is cancelled and it's > > mostly likely going to be a write unless it's an application bug. > > 2. sys_movepages -- the application has explictly stated the data is in > > use on a particular node yet any MADV_FREE page gets discarded > > 3. Compaction -- there may be no memory pressure at all but the > > MADV_FREE memory is discarded prematurely > > > > In the first case, the data is explicitly in use, most likely due to > > a write in which case it's inappropriate to discard. Discarding and > > reallocating a zero'd page is not expected. > > If my understanding were correct, NUMA balancing will not try to migrate > clean MADV_FREE pages most of the times. Because the lazily freed pages > may be zeroed at any time, it makes almost no sense to read them. So, > the first access after being freed lazily should be writing. That will > make the page dirty MADV_FREE, so will not be discarded. And the first > writing access in a new node will not trigger migration usually because > of the two-stage filter in should_numa_migrate_memory(). So current > behavior in case 1 will not change most of the times after the patchset. > Yes, so in this case the series neither helps nor hurts NUMA balancing. > > In second case, the data is likely in use or else why would the system > > call be used? > > If the pages were in use, they shouldn't be clean MADV_FREE pages. So > no behavior change after this patchset. > You cannot guarantee that. The application could be caching them optimistically as long as they stay resident until memory pressure forces them out. Consider something like an in-memory object database. For very old objects, it might decide to mark them MADV_FREE using a header at the start to detect when a page still has valid data. The intent would be that under memory pressure, the hot data would be preserved as long as possible. I'm not aware of such an application, it simply is a valid use case. Similarly, a malloc implementation may be using MADV_FREE to mark freed buffers so they can be quickly reused. Now if they use sys_movepages, they take the alloc+zero hit and the strategy is less useful. > > In the third case the timing of when MADV_FREE pages disappear is > > arbitrary as it can happen without any actual memory pressure. This > > may or may not be problematic but it leads to unpredictable latencies > > for applications that use MADV_FREE for a quick malloc/free > > implementation. Before, as long as there is no pressure, the reuse of > > a MADV_FREE incurs just a small penalty but now with compaction it > > does not matter if the system avoids memory pressure because they may > > still incur a fault to allocate and zero a new page. > > Yes. This is a real problem. Previously we thought that the migration > is kind of > > - unmap and free the old pages > - map the new pages > > If we can allocate new pages lazily in mmap(), why cannot we allocate > new pages lazily in migrating pages too if possible (for clean > MADV_CLEAN pages) because its second stage is kind of mmap() too. But > you remind us that there are some differences, > > - mmap() is called by the application directly, so its effect is > predictable. While some migration like that in the compaction isn't > called by the application, so may have unpredictable behavior. > - With mmap(), the application can choose to allocate new pages lazily > or eagerly. But we lacks the same mechanism in this patchset. > > So, maybe we can make the mechanism more flexible. That is, let the > administrator or the application choose the right behavior for them, via > some system level configuration knob or API flags. For example, if the > memory allocation latency isn't critical for the workloads, the > administrator can choose to discard clean MADV_FREE pages to make > compaction quicker and easier? > That will be the type of knob that is almost impossible to tune for. More on this later. > > <SNIP> > > > > This may lead to an odd scenario whereby applications occupy dram > > prematurely due to a single reference of a MADV_FREE page. > > The better way to deal with this is to enhance mem_cgroup to limit DRAM > usage in a memory tiering system? > I did not respond to this properly because it's a separate discussion on implementation details of something that does not exist. I only mentioned it to highlight a potential hazard that could raise its head later. > > It's all subtle enough that we really should have an example application > > in mind that benefits so we can weigh the benefits against the potential > > risks. > > If some applications cannot tolerate the latency incurred by the memory > allocation and zeroing. Then we cannot discard instead of migrate > always. While in some situations, less memory pressure can help. So > it's better to let the administrator and the application choose the > right behavior in the specific situation? > Is there an application you have in mind that benefits from discarding MADV_FREE pages instead of migrating them? Allowing the administrator or application to tune this would be very problematic. An application would require an update to the system call to take advantage of it and then detect if the running kernel supports it. An administrator would have to detect that MADV_FREE pages are being prematurely discarded leading to a slowdown and that is hard to detect. It could be inferred from monitoring compaction stats and checking if compaction activity is correlated with higher minor faults in the target application. Proving the correlation would require using the perf software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses to MADV_FREE regions that were freed prematurely. That is not an obvious debugging step to take when an application detects latency spikes. Now, you could add a counter specifically for MADV_FREE pages freed for reasons other than memory pressure and hope the administrator knows about the counter and what it means. That type of knowledge could take a long time to spread so it's really very important that there is evidence of an application that suffers due to the current MADV_FREE and migration behaviour.
David Hildenbrand <david@redhat.com> writes: > On 02.03.20 15:12, Huang, Ying wrote: >> Michal Hocko <mhocko@kernel.org> writes: >> >>> On Fri 28-02-20 16:55:40, Huang, Ying wrote: >>>> David Hildenbrand <david@redhat.com> writes: >>> [...] >>>>> E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will >>>>> report currently free pages to the hypervisor, which will MADV_FREE the >>>>> reported memory. As long as there is no memory pressure, there is no >>>>> need to actually free the pages. Once the guest reuses such a page, it >>>>> could happen that there is still the old page and pulling in in a fresh >>>>> (zeroed) page can be avoided. >>>>> >>>>> AFAIKs, after your change, we would get more pages discarded from our >>>>> guest, resulting in more fresh (zeroed) pages having to be pulled in >>>>> when a guest touches a reported free page again. But OTOH, page >>>>> migration is speed up (avoiding to migrate these pages). >>>> >>>> Let's look at this problem in another perspective. To migrate the >>>> MADV_FREE pages of the QEMU process from the node A to the node B, we >>>> need to free the original pages in the node A, and (maybe) allocate the >>>> same number of pages in the node B. So the question becomes >>>> >>>> - we may need to allocate some pages in the node B >>>> - these pages may be accessed by the application or not >>>> - we should allocate all these pages in advance or allocate them lazily >>>> when they are accessed. >>>> >>>> We thought the common philosophy in Linux kernel is to allocate lazily. >>> >>> The common philosophy is to cache as much as possible. >> >> Yes. This is another common philosophy. And MADV_FREE pages is >> different from caches such as the page caches because it has no valid >> contents. > > Side note: It might contain valid content until discarded/zeroed out. > E.g., an application could use a marker bit (e.g., first bit) to detect > if the page still contains valid data or not. If the data is still > marked valid, the content could be reuse immediately. Not sure if there > is any such application, though :) I don't think this is the typical use case. But I admit that this is possible. Best Regards, Huang, Ying
Michal Hocko <mhocko@kernel.org> writes: > On Mon 02-03-20 22:12:53, Huang, Ying wrote: >> Michal Hocko <mhocko@kernel.org> writes: > [...] >> > And MADV_FREE pages are a kind of cache as well. If the target node is >> > short on memory then those will be reclaimed as a cache so a >> > pro-active freeing sounds counter productive as you do not have any >> > idea whether that cache is going to be used in future. In other words >> > you are not going to free a clean page cache if you want to use that >> > memory as a migration target right? So you should make a clear case >> > about why MADV_FREE cache is less important than the clean page cache >> > and ideally have a good justification backed by real workloads. >> >> Clean page cache still have valid contents, while clean MADV_FREE pages >> has no valid contents. So penalty of discarding the clean page cache is >> reading from disk, while the penalty of discarding clean MADV_FREE pages >> is just page allocation and zeroing. > > And "just page allocation and zeroing" overhead is the primary > motivation to keep the page in memory. It is a decision of the workload > to use MADV_FREE because chances are that this will speed things up. All > that with a contract that the memory goes away under memory pressure so > with a good workload/memory sizing you do not really lose that > optimization. Now you want to make decision on behalf of the consumer of > the MADV_FREE memory. I understand that MADV_FREE helps in some situations. And if the application want to keep the "contract" after migration, they should have a way to do that. >> I understand that MADV_FREE is another kind of cache and has its value. >> But in the original implementation, during migration, we have already >> freed the original "cache", then reallocate the cache elsewhere and >> copy. This appears more like all pages are populated in mmap() always. >> I know there's value to populate all pages in mmap(), but does that need >> to be done always by default? > > It is not. You have to explicitly request MAP_POPULATE to initialize > mmap. Yes. mmap() can control whether to populate the underlying physical pages. But for migrating MADV_FREE pages, there's no control, all pages will be populated again always by default. Maybe we should avoid to do that in some situations too. Best Regards, Huang, Ying
Mel Gorman <mgorman@suse.de> writes: > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote: >> If some applications cannot tolerate the latency incurred by the memory >> allocation and zeroing. Then we cannot discard instead of migrate >> always. While in some situations, less memory pressure can help. So >> it's better to let the administrator and the application choose the >> right behavior in the specific situation? >> > > Is there an application you have in mind that benefits from discarding > MADV_FREE pages instead of migrating them? > > Allowing the administrator or application to tune this would be very > problematic. An application would require an update to the system call > to take advantage of it and then detect if the running kernel supports > it. An administrator would have to detect that MADV_FREE pages are being > prematurely discarded leading to a slowdown and that is hard to detect. > It could be inferred from monitoring compaction stats and checking > if compaction activity is correlated with higher minor faults in the > target application. Proving the correlation would require using the perf > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses > to MADV_FREE regions that were freed prematurely. That is not an obvious > debugging step to take when an application detects latency spikes. > > Now, you could add a counter specifically for MADV_FREE pages freed for > reasons other than memory pressure and hope the administrator knows about > the counter and what it means. That type of knowledge could take a long > time to spread so it's really very important that there is evidence of > an application that suffers due to the current MADV_FREE and migration > behaviour. OK. I understand that this patchset isn't a universal win, so we need some way to justify it. I will try to find some application for that. Another thought, as proposed by David Hildenbrand, it's may be a universal win to discard clean MADV_FREE pages when migrating if there are already memory pressure on the target node. For example, if the free memory on the target node is lower than high watermark? Best Regards, Huang, Ying
On Tue 03-03-20 09:51:56, Huang, Ying wrote: > Mel Gorman <mgorman@suse.de> writes: > > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote: > >> If some applications cannot tolerate the latency incurred by the memory > >> allocation and zeroing. Then we cannot discard instead of migrate > >> always. While in some situations, less memory pressure can help. So > >> it's better to let the administrator and the application choose the > >> right behavior in the specific situation? > >> > > > > Is there an application you have in mind that benefits from discarding > > MADV_FREE pages instead of migrating them? > > > > Allowing the administrator or application to tune this would be very > > problematic. An application would require an update to the system call > > to take advantage of it and then detect if the running kernel supports > > it. An administrator would have to detect that MADV_FREE pages are being > > prematurely discarded leading to a slowdown and that is hard to detect. > > It could be inferred from monitoring compaction stats and checking > > if compaction activity is correlated with higher minor faults in the > > target application. Proving the correlation would require using the perf > > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses > > to MADV_FREE regions that were freed prematurely. That is not an obvious > > debugging step to take when an application detects latency spikes. > > > > Now, you could add a counter specifically for MADV_FREE pages freed for > > reasons other than memory pressure and hope the administrator knows about > > the counter and what it means. That type of knowledge could take a long > > time to spread so it's really very important that there is evidence of > > an application that suffers due to the current MADV_FREE and migration > > behaviour. > > OK. I understand that this patchset isn't a universal win, so we need > some way to justify it. I will try to find some application for that. > > Another thought, as proposed by David Hildenbrand, it's may be a > universal win to discard clean MADV_FREE pages when migrating if there are > already memory pressure on the target node. For example, if the free > memory on the target node is lower than high watermark? This is already happening because if the target node is short on memory it will start to reclaim and if MADV_FREE pages are at the tail of inactive file LRU list then they will be dropped. Please note how that follows proper aging and doesn't introduce any special casing. Really MADV_FREE is an inactive cache for anonymous memory and we treat it like inactive page cache. This is not carved in stone of course but it really requires very good justification to change.
On Tue 03-03-20 09:30:28, Huang, Ying wrote: [...] > Yes. mmap() can control whether to populate the underlying physical > pages. right because many usecases benefit from it. They simply know that the mapping will be used completely and it is worth saving overhead for #PF. See. there is a clear justification for that policy. > But for migrating MADV_FREE pages, there's no control, all pages > will be populated again always by default. Maybe we should avoid to do > that in some situations too. Now let's have a look here. It is the userspace that decided to mark MADV_FREE pages. It is under its full control which pages are to be freed lazily. If the userspace wants to move those pages then it is likely aware they have been MADV_FREE, right? If the userspace wanted to save migration overhead then it could either chose to not migrate those pages or simply unmap them right away. So in the end we are talking about saving munmap/MAMDV_DONTNEED or potentially more move_pages calls to skip over MADV_FREE holes. Which is all nice but is there any userspace that really does care? Because this is a fundamental question here and it doesn't make much sense to discuss this left to right unless this is clear.
Michal Hocko <mhocko@kernel.org> writes: > On Tue 03-03-20 09:51:56, Huang, Ying wrote: >> Mel Gorman <mgorman@suse.de> writes: >> > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote: >> >> If some applications cannot tolerate the latency incurred by the memory >> >> allocation and zeroing. Then we cannot discard instead of migrate >> >> always. While in some situations, less memory pressure can help. So >> >> it's better to let the administrator and the application choose the >> >> right behavior in the specific situation? >> >> >> > >> > Is there an application you have in mind that benefits from discarding >> > MADV_FREE pages instead of migrating them? >> > >> > Allowing the administrator or application to tune this would be very >> > problematic. An application would require an update to the system call >> > to take advantage of it and then detect if the running kernel supports >> > it. An administrator would have to detect that MADV_FREE pages are being >> > prematurely discarded leading to a slowdown and that is hard to detect. >> > It could be inferred from monitoring compaction stats and checking >> > if compaction activity is correlated with higher minor faults in the >> > target application. Proving the correlation would require using the perf >> > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses >> > to MADV_FREE regions that were freed prematurely. That is not an obvious >> > debugging step to take when an application detects latency spikes. >> > >> > Now, you could add a counter specifically for MADV_FREE pages freed for >> > reasons other than memory pressure and hope the administrator knows about >> > the counter and what it means. That type of knowledge could take a long >> > time to spread so it's really very important that there is evidence of >> > an application that suffers due to the current MADV_FREE and migration >> > behaviour. >> >> OK. I understand that this patchset isn't a universal win, so we need >> some way to justify it. I will try to find some application for that. >> >> Another thought, as proposed by David Hildenbrand, it's may be a >> universal win to discard clean MADV_FREE pages when migrating if there are >> already memory pressure on the target node. For example, if the free >> memory on the target node is lower than high watermark? > > This is already happening because if the target node is short on memory > it will start to reclaim and if MADV_FREE pages are at the tail of > inactive file LRU list then they will be dropped. Please note how that > follows proper aging and doesn't introduce any special casing. Really > MADV_FREE is an inactive cache for anonymous memory and we treat it like > inactive page cache. This is not carved in stone of course but it really > requires very good justification to change. If my understanding were correct, the newly migrated clean MADV_FREE pages will be put at the head of inactive file LRU list instead of the tail. So it's possible that some useful file cache pages will be reclaimed. Best Regards, Huang, Ying
On Tue 03-03-20 16:47:54, Huang, Ying wrote: > Michal Hocko <mhocko@kernel.org> writes: > > > On Tue 03-03-20 09:51:56, Huang, Ying wrote: > >> Mel Gorman <mgorman@suse.de> writes: > >> > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote: > >> >> If some applications cannot tolerate the latency incurred by the memory > >> >> allocation and zeroing. Then we cannot discard instead of migrate > >> >> always. While in some situations, less memory pressure can help. So > >> >> it's better to let the administrator and the application choose the > >> >> right behavior in the specific situation? > >> >> > >> > > >> > Is there an application you have in mind that benefits from discarding > >> > MADV_FREE pages instead of migrating them? > >> > > >> > Allowing the administrator or application to tune this would be very > >> > problematic. An application would require an update to the system call > >> > to take advantage of it and then detect if the running kernel supports > >> > it. An administrator would have to detect that MADV_FREE pages are being > >> > prematurely discarded leading to a slowdown and that is hard to detect. > >> > It could be inferred from monitoring compaction stats and checking > >> > if compaction activity is correlated with higher minor faults in the > >> > target application. Proving the correlation would require using the perf > >> > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses > >> > to MADV_FREE regions that were freed prematurely. That is not an obvious > >> > debugging step to take when an application detects latency spikes. > >> > > >> > Now, you could add a counter specifically for MADV_FREE pages freed for > >> > reasons other than memory pressure and hope the administrator knows about > >> > the counter and what it means. That type of knowledge could take a long > >> > time to spread so it's really very important that there is evidence of > >> > an application that suffers due to the current MADV_FREE and migration > >> > behaviour. > >> > >> OK. I understand that this patchset isn't a universal win, so we need > >> some way to justify it. I will try to find some application for that. > >> > >> Another thought, as proposed by David Hildenbrand, it's may be a > >> universal win to discard clean MADV_FREE pages when migrating if there are > >> already memory pressure on the target node. For example, if the free > >> memory on the target node is lower than high watermark? > > > > This is already happening because if the target node is short on memory > > it will start to reclaim and if MADV_FREE pages are at the tail of > > inactive file LRU list then they will be dropped. Please note how that > > follows proper aging and doesn't introduce any special casing. Really > > MADV_FREE is an inactive cache for anonymous memory and we treat it like > > inactive page cache. This is not carved in stone of course but it really > > requires very good justification to change. > > If my understanding were correct, the newly migrated clean MADV_FREE > pages will be put at the head of inactive file LRU list instead of the > tail. So it's possible that some useful file cache pages will be > reclaimed. This is the case also when you migrate other pages, right? We simply cannot preserve the aging.
Michal Hocko <mhocko@kernel.org> writes: > On Tue 03-03-20 09:30:28, Huang, Ying wrote: > [...] >> Yes. mmap() can control whether to populate the underlying physical >> pages. > > right because many usecases benefit from it. They simply know that the > mapping will be used completely and it is worth saving overhead for #PF. > See. there is a clear justification for that policy. > >> But for migrating MADV_FREE pages, there's no control, all pages >> will be populated again always by default. Maybe we should avoid to do >> that in some situations too. > > Now let's have a look here. It is the userspace that decided to mark > MADV_FREE pages. It is under its full control which pages are to be > freed lazily. If the userspace wants to move those pages then it is > likely aware they have been MADV_FREE, right? If the userspace wanted to > save migration overhead then it could either chose to not migrate those > pages or simply unmap them right away. So in the end we are talking > about saving munmap/MAMDV_DONTNEED or potentially more move_pages calls > to skip over MADV_FREE holes. Which is all nice but is there any > userspace that really does care? Because this is a fundamental question > here and it doesn't make much sense to discuss this left to right unless > this is clear. Although I don't agree with you, I don't want to continue. Because I feel that the discussion may be too general to go anywhere. I admit that I go to the general side firstly, sorry about that. Best Regards, Huang, Ying
Michal Hocko <mhocko@kernel.org> writes: > On Tue 03-03-20 16:47:54, Huang, Ying wrote: >> Michal Hocko <mhocko@kernel.org> writes: >> >> > On Tue 03-03-20 09:51:56, Huang, Ying wrote: >> >> Mel Gorman <mgorman@suse.de> writes: >> >> > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote: >> >> >> If some applications cannot tolerate the latency incurred by the memory >> >> >> allocation and zeroing. Then we cannot discard instead of migrate >> >> >> always. While in some situations, less memory pressure can help. So >> >> >> it's better to let the administrator and the application choose the >> >> >> right behavior in the specific situation? >> >> >> >> >> > >> >> > Is there an application you have in mind that benefits from discarding >> >> > MADV_FREE pages instead of migrating them? >> >> > >> >> > Allowing the administrator or application to tune this would be very >> >> > problematic. An application would require an update to the system call >> >> > to take advantage of it and then detect if the running kernel supports >> >> > it. An administrator would have to detect that MADV_FREE pages are being >> >> > prematurely discarded leading to a slowdown and that is hard to detect. >> >> > It could be inferred from monitoring compaction stats and checking >> >> > if compaction activity is correlated with higher minor faults in the >> >> > target application. Proving the correlation would require using the perf >> >> > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses >> >> > to MADV_FREE regions that were freed prematurely. That is not an obvious >> >> > debugging step to take when an application detects latency spikes. >> >> > >> >> > Now, you could add a counter specifically for MADV_FREE pages freed for >> >> > reasons other than memory pressure and hope the administrator knows about >> >> > the counter and what it means. That type of knowledge could take a long >> >> > time to spread so it's really very important that there is evidence of >> >> > an application that suffers due to the current MADV_FREE and migration >> >> > behaviour. >> >> >> >> OK. I understand that this patchset isn't a universal win, so we need >> >> some way to justify it. I will try to find some application for that. >> >> >> >> Another thought, as proposed by David Hildenbrand, it's may be a >> >> universal win to discard clean MADV_FREE pages when migrating if there are >> >> already memory pressure on the target node. For example, if the free >> >> memory on the target node is lower than high watermark? >> > >> > This is already happening because if the target node is short on memory >> > it will start to reclaim and if MADV_FREE pages are at the tail of >> > inactive file LRU list then they will be dropped. Please note how that >> > follows proper aging and doesn't introduce any special casing. Really >> > MADV_FREE is an inactive cache for anonymous memory and we treat it like >> > inactive page cache. This is not carved in stone of course but it really >> > requires very good justification to change. >> >> If my understanding were correct, the newly migrated clean MADV_FREE >> pages will be put at the head of inactive file LRU list instead of the >> tail. So it's possible that some useful file cache pages will be >> reclaimed. > > This is the case also when you migrate other pages, right? We simply > cannot preserve the aging. So you consider the priority of the clean MADV_FREE pages is same as that of page cache pages? Because the penalty difference is so large, I think it may be a good idea to always put clean MADV_FREE pages at the tail of the inactive file LRU list? Best Regards, Huang, Ying
On Tue, Mar 03, 2020 at 09:51:56AM +0800, Huang, Ying wrote: > Mel Gorman <mgorman@suse.de> writes: > > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote: > >> If some applications cannot tolerate the latency incurred by the memory > >> allocation and zeroing. Then we cannot discard instead of migrate > >> always. While in some situations, less memory pressure can help. So > >> it's better to let the administrator and the application choose the > >> right behavior in the specific situation? > >> > > > > Is there an application you have in mind that benefits from discarding > > MADV_FREE pages instead of migrating them? > > > > Allowing the administrator or application to tune this would be very > > problematic. An application would require an update to the system call > > to take advantage of it and then detect if the running kernel supports > > it. An administrator would have to detect that MADV_FREE pages are being > > prematurely discarded leading to a slowdown and that is hard to detect. > > It could be inferred from monitoring compaction stats and checking > > if compaction activity is correlated with higher minor faults in the > > target application. Proving the correlation would require using the perf > > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses > > to MADV_FREE regions that were freed prematurely. That is not an obvious > > debugging step to take when an application detects latency spikes. > > > > Now, you could add a counter specifically for MADV_FREE pages freed for > > reasons other than memory pressure and hope the administrator knows about > > the counter and what it means. That type of knowledge could take a long > > time to spread so it's really very important that there is evidence of > > an application that suffers due to the current MADV_FREE and migration > > behaviour. > > OK. I understand that this patchset isn't a universal win, so we need > some way to justify it. I will try to find some application for that. > > Another thought, as proposed by David Hildenbrand, it's may be a > universal win to discard clean MADV_FREE pages when migrating if there are > already memory pressure on the target node. For example, if the free > memory on the target node is lower than high watermark? > That is an extremely specific corner case that is not likely to occur. NUMA balancing is not going to migrate a MADV_FREE page under these circumstances as a write cancels MADV_FREE is read attempt will probably fail to allocate a destination page in alloc_misplaced_dst_page so the data gets lost instead of remaining remote. sys_movepages is a possibility but the circumstances of an application delibertly trying to migrate to a loaded node is low. Compaction never migrates cross-node so the state of a remote node under pressure do not matter. Once again, there needs to be a reasonable use case to be able to meaningfully balance between the benefits and risks of changing the MADV_FREE semantics.
Mel Gorman <mgorman@suse.de> writes: > On Tue, Mar 03, 2020 at 09:51:56AM +0800, Huang, Ying wrote: >> Mel Gorman <mgorman@suse.de> writes: >> > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote: >> >> If some applications cannot tolerate the latency incurred by the memory >> >> allocation and zeroing. Then we cannot discard instead of migrate >> >> always. While in some situations, less memory pressure can help. So >> >> it's better to let the administrator and the application choose the >> >> right behavior in the specific situation? >> >> >> > >> > Is there an application you have in mind that benefits from discarding >> > MADV_FREE pages instead of migrating them? >> > >> > Allowing the administrator or application to tune this would be very >> > problematic. An application would require an update to the system call >> > to take advantage of it and then detect if the running kernel supports >> > it. An administrator would have to detect that MADV_FREE pages are being >> > prematurely discarded leading to a slowdown and that is hard to detect. >> > It could be inferred from monitoring compaction stats and checking >> > if compaction activity is correlated with higher minor faults in the >> > target application. Proving the correlation would require using the perf >> > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses >> > to MADV_FREE regions that were freed prematurely. That is not an obvious >> > debugging step to take when an application detects latency spikes. >> > >> > Now, you could add a counter specifically for MADV_FREE pages freed for >> > reasons other than memory pressure and hope the administrator knows about >> > the counter and what it means. That type of knowledge could take a long >> > time to spread so it's really very important that there is evidence of >> > an application that suffers due to the current MADV_FREE and migration >> > behaviour. >> >> OK. I understand that this patchset isn't a universal win, so we need >> some way to justify it. I will try to find some application for that. >> >> Another thought, as proposed by David Hildenbrand, it's may be a >> universal win to discard clean MADV_FREE pages when migrating if there are >> already memory pressure on the target node. For example, if the free >> memory on the target node is lower than high watermark? >> > > That is an extremely specific corner case that is not likely to occur. > NUMA balancing is not going to migrate a MADV_FREE page under these > circumstances as a write cancels MADV_FREE is read attempt will probably > fail to allocate a destination page in alloc_misplaced_dst_page so the > data gets lost instead of remaining remote. sys_movepages is a possibility > but the circumstances of an application delibertly trying to migrate to > a loaded node is low. Compaction never migrates cross-node so the state > of a remote node under pressure do not matter. > > Once again, there needs to be a reasonable use case to be able to > meaningfully balance between the benefits and risks of changing the > MADV_FREE semantics. OK. Will try to find some workloads for this. Best Regards, Huang, Ying
On Tue 03-03-20 19:49:53, Huang, Ying wrote: > Michal Hocko <mhocko@kernel.org> writes: > > > On Tue 03-03-20 16:47:54, Huang, Ying wrote: > >> Michal Hocko <mhocko@kernel.org> writes: > >> > >> > On Tue 03-03-20 09:51:56, Huang, Ying wrote: > >> >> Mel Gorman <mgorman@suse.de> writes: > >> >> > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote: > >> >> >> If some applications cannot tolerate the latency incurred by the memory > >> >> >> allocation and zeroing. Then we cannot discard instead of migrate > >> >> >> always. While in some situations, less memory pressure can help. So > >> >> >> it's better to let the administrator and the application choose the > >> >> >> right behavior in the specific situation? > >> >> >> > >> >> > > >> >> > Is there an application you have in mind that benefits from discarding > >> >> > MADV_FREE pages instead of migrating them? > >> >> > > >> >> > Allowing the administrator or application to tune this would be very > >> >> > problematic. An application would require an update to the system call > >> >> > to take advantage of it and then detect if the running kernel supports > >> >> > it. An administrator would have to detect that MADV_FREE pages are being > >> >> > prematurely discarded leading to a slowdown and that is hard to detect. > >> >> > It could be inferred from monitoring compaction stats and checking > >> >> > if compaction activity is correlated with higher minor faults in the > >> >> > target application. Proving the correlation would require using the perf > >> >> > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses > >> >> > to MADV_FREE regions that were freed prematurely. That is not an obvious > >> >> > debugging step to take when an application detects latency spikes. > >> >> > > >> >> > Now, you could add a counter specifically for MADV_FREE pages freed for > >> >> > reasons other than memory pressure and hope the administrator knows about > >> >> > the counter and what it means. That type of knowledge could take a long > >> >> > time to spread so it's really very important that there is evidence of > >> >> > an application that suffers due to the current MADV_FREE and migration > >> >> > behaviour. > >> >> > >> >> OK. I understand that this patchset isn't a universal win, so we need > >> >> some way to justify it. I will try to find some application for that. > >> >> > >> >> Another thought, as proposed by David Hildenbrand, it's may be a > >> >> universal win to discard clean MADV_FREE pages when migrating if there are > >> >> already memory pressure on the target node. For example, if the free > >> >> memory on the target node is lower than high watermark? > >> > > >> > This is already happening because if the target node is short on memory > >> > it will start to reclaim and if MADV_FREE pages are at the tail of > >> > inactive file LRU list then they will be dropped. Please note how that > >> > follows proper aging and doesn't introduce any special casing. Really > >> > MADV_FREE is an inactive cache for anonymous memory and we treat it like > >> > inactive page cache. This is not carved in stone of course but it really > >> > requires very good justification to change. > >> > >> If my understanding were correct, the newly migrated clean MADV_FREE > >> pages will be put at the head of inactive file LRU list instead of the > >> tail. So it's possible that some useful file cache pages will be > >> reclaimed. > > > > This is the case also when you migrate other pages, right? We simply > > cannot preserve the aging. > > So you consider the priority of the clean MADV_FREE pages is same as > that of page cache pages? This is how MADV_FREE has been implemented, yes. See f7ad2a6cb9f7 ("mm: move MADV_FREE pages into LRU_INACTIVE_FILE list") for the justification. > Because the penalty difference is so large, I > think it may be a good idea to always put clean MADV_FREE pages at the > tail of the inactive file LRU list? You are again making assumptions without giving any actual real examples. Reconstructing MADV_FREE pages cost can differ a lot. This really depends on the specific usecase. Moving pages to the tail of LRU would make them the primary candidate for the reclaim with a strange LIFO semantic. Adding them to the head might be not the universal win but it will at least provide a reasonable FIFO semantic. I also find it much more easier to reason about MADV_FREE as an inactive cache.
On Wed, Mar 04, 2020 at 10:58:02AM +0100, Michal Hocko wrote: > > >> If my understanding were correct, the newly migrated clean MADV_FREE > > >> pages will be put at the head of inactive file LRU list instead of the > > >> tail. So it's possible that some useful file cache pages will be > > >> reclaimed. > > > > > > This is the case also when you migrate other pages, right? We simply > > > cannot preserve the aging. > > > > So you consider the priority of the clean MADV_FREE pages is same as > > that of page cache pages? > > This is how MADV_FREE has been implemented, yes. See f7ad2a6cb9f7 ("mm: > move MADV_FREE pages into LRU_INACTIVE_FILE list") for the > justification. > > > Because the penalty difference is so large, I > > think it may be a good idea to always put clean MADV_FREE pages at the > > tail of the inactive file LRU list? > > You are again making assumptions without giving any actual real > examples. Reconstructing MADV_FREE pages cost can differ a lot. This > really depends on the specific usecase. Moving pages to the tail of LRU > would make them the primary candidate for the reclaim with a strange > LIFO semantic. Adding them to the head might be not the universal win > but it will at least provide a reasonable FIFO semantic. I also find > it much more easier to reason about MADV_FREE as an inactive cache. I tend to agree, that would make MADV_FREE behave more like a PageReclaim page that gets tagged for immediate reclaim when writeback completes. Immediate reclaim is in response to heavy memory pressure where there is trouble finding clean file pages to reclaim and dirty/writeback pages are getting artifically preserved over hot-but-clean file pages. That is a clear inversion of the order pages should be reclaimed and is justified. While there *might* be a basis for reclaiming MADV_FREE sooner rather than later, there would have to be some evidence of a Page inversion problem where a known hot page was getting reclaimed before MADV_FREE pages. For example, it could easily be considered a bug to free MADV_FREE pages over a page that was last touched at boot time. About the only real concern I could find about MADV_FREE is that it keeps RSS artifically high relative to MADV_DONTNEED in the absense of memory pressure. In some cases userspace provided a way of switching to MADV_DONTNEED at startup time to determine if there is a memory leak or just MADV_FREE keeping pages resident. They probably would have benefitted from a counter noting the number of MADV_FREE pages in the system as opposed to the vmstat event or some other way of distinguishing real RSS from MADV_FREE. However, I can't find a bug report indicating that MADV_FREE pages were pushing hot pages out to disk (be it file-backed or anonymous).
Michal Hocko <mhocko@kernel.org> writes: > On Tue 03-03-20 19:49:53, Huang, Ying wrote: >> Michal Hocko <mhocko@kernel.org> writes: >> >> > On Tue 03-03-20 16:47:54, Huang, Ying wrote: >> >> Michal Hocko <mhocko@kernel.org> writes: >> >> >> >> > On Tue 03-03-20 09:51:56, Huang, Ying wrote: >> >> >> Mel Gorman <mgorman@suse.de> writes: >> >> >> > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote: >> >> >> >> If some applications cannot tolerate the latency incurred by the memory >> >> >> >> allocation and zeroing. Then we cannot discard instead of migrate >> >> >> >> always. While in some situations, less memory pressure can help. So >> >> >> >> it's better to let the administrator and the application choose the >> >> >> >> right behavior in the specific situation? >> >> >> >> >> >> >> > >> >> >> > Is there an application you have in mind that benefits from discarding >> >> >> > MADV_FREE pages instead of migrating them? >> >> >> > >> >> >> > Allowing the administrator or application to tune this would be very >> >> >> > problematic. An application would require an update to the system call >> >> >> > to take advantage of it and then detect if the running kernel supports >> >> >> > it. An administrator would have to detect that MADV_FREE pages are being >> >> >> > prematurely discarded leading to a slowdown and that is hard to detect. >> >> >> > It could be inferred from monitoring compaction stats and checking >> >> >> > if compaction activity is correlated with higher minor faults in the >> >> >> > target application. Proving the correlation would require using the perf >> >> >> > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses >> >> >> > to MADV_FREE regions that were freed prematurely. That is not an obvious >> >> >> > debugging step to take when an application detects latency spikes. >> >> >> > >> >> >> > Now, you could add a counter specifically for MADV_FREE pages freed for >> >> >> > reasons other than memory pressure and hope the administrator knows about >> >> >> > the counter and what it means. That type of knowledge could take a long >> >> >> > time to spread so it's really very important that there is evidence of >> >> >> > an application that suffers due to the current MADV_FREE and migration >> >> >> > behaviour. >> >> >> >> >> >> OK. I understand that this patchset isn't a universal win, so we need >> >> >> some way to justify it. I will try to find some application for that. >> >> >> >> >> >> Another thought, as proposed by David Hildenbrand, it's may be a >> >> >> universal win to discard clean MADV_FREE pages when migrating if there are >> >> >> already memory pressure on the target node. For example, if the free >> >> >> memory on the target node is lower than high watermark? >> >> > >> >> > This is already happening because if the target node is short on memory >> >> > it will start to reclaim and if MADV_FREE pages are at the tail of >> >> > inactive file LRU list then they will be dropped. Please note how that >> >> > follows proper aging and doesn't introduce any special casing. Really >> >> > MADV_FREE is an inactive cache for anonymous memory and we treat it like >> >> > inactive page cache. This is not carved in stone of course but it really >> >> > requires very good justification to change. >> >> >> >> If my understanding were correct, the newly migrated clean MADV_FREE >> >> pages will be put at the head of inactive file LRU list instead of the >> >> tail. So it's possible that some useful file cache pages will be >> >> reclaimed. >> > >> > This is the case also when you migrate other pages, right? We simply >> > cannot preserve the aging. >> >> So you consider the priority of the clean MADV_FREE pages is same as >> that of page cache pages? > > This is how MADV_FREE has been implemented, yes. See f7ad2a6cb9f7 ("mm: > move MADV_FREE pages into LRU_INACTIVE_FILE list") for the > justification. Thanks for information. It's really helpful! >> Because the penalty difference is so large, I >> think it may be a good idea to always put clean MADV_FREE pages at the >> tail of the inactive file LRU list? > > You are again making assumptions without giving any actual real > examples. Reconstructing MADV_FREE pages cost can differ a lot. In which situation the cost to reconstruct MADV_FREE pages can be higher than the cost to allocate file cache page and read from disk? Heavy contention on mmap_sem? > This really depends on the specific usecase. Moving pages to the tail > of LRU would make them the primary candidate for the reclaim with a > strange LIFO semantic. Adding them to the head might be not the > universal win but it will at least provide a reasonable FIFO > semantic. I also find it much more easier to reason about MADV_FREE as > an inactive cache. Yes. FIFO is more reasonable than LIFO. Best Regards, Huang, Ying
On Wed 04-03-20 19:15:20, Huang, Ying wrote: > Michal Hocko <mhocko@kernel.org> writes: > > > On Tue 03-03-20 19:49:53, Huang, Ying wrote: [...] > >> Because the penalty difference is so large, I > >> think it may be a good idea to always put clean MADV_FREE pages at the > >> tail of the inactive file LRU list? > > > > You are again making assumptions without giving any actual real > > examples. Reconstructing MADV_FREE pages cost can differ a lot. > > In which situation the cost to reconstruct MADV_FREE pages can be higher > than the cost to allocate file cache page and read from disk? Heavy > contention on mmap_sem? Allocating a page might be really costly but even without that. Reconstructnig the content of the page can be really high and actually much larger than an IO to get a page from the storage. Consider decompression or some other transformations to get the content. Userspace has means to detect that the content is still up-to-date as already has been mentioned so all those steps can be avoided.
Mel Gorman <mgorman@suse.de> writes: > On Wed, Mar 04, 2020 at 10:58:02AM +0100, Michal Hocko wrote: >> > >> If my understanding were correct, the newly migrated clean MADV_FREE >> > >> pages will be put at the head of inactive file LRU list instead of the >> > >> tail. So it's possible that some useful file cache pages will be >> > >> reclaimed. >> > > >> > > This is the case also when you migrate other pages, right? We simply >> > > cannot preserve the aging. >> > >> > So you consider the priority of the clean MADV_FREE pages is same as >> > that of page cache pages? >> >> This is how MADV_FREE has been implemented, yes. See f7ad2a6cb9f7 ("mm: >> move MADV_FREE pages into LRU_INACTIVE_FILE list") for the >> justification. >> >> > Because the penalty difference is so large, I >> > think it may be a good idea to always put clean MADV_FREE pages at the >> > tail of the inactive file LRU list? >> >> You are again making assumptions without giving any actual real >> examples. Reconstructing MADV_FREE pages cost can differ a lot. This >> really depends on the specific usecase. Moving pages to the tail of LRU >> would make them the primary candidate for the reclaim with a strange >> LIFO semantic. Adding them to the head might be not the universal win >> but it will at least provide a reasonable FIFO semantic. I also find >> it much more easier to reason about MADV_FREE as an inactive cache. > > I tend to agree, that would make MADV_FREE behave more like a > PageReclaim page that gets tagged for immediate reclaim when writeback > completes. Immediate reclaim is in response to heavy memory pressure where > there is trouble finding clean file pages to reclaim and dirty/writeback > pages are getting artifically preserved over hot-but-clean file > pages. That is a clear inversion of the order pages should be reclaimed > and is justified. While there *might* be a basis for reclaiming MADV_FREE > sooner rather than later, there would have to be some evidence of a Page > inversion problem where a known hot page was getting reclaimed before > MADV_FREE pages. For example, it could easily be considered a bug to free > MADV_FREE pages over a page that was last touched at boot time. Yes. This sounds reasonable. Although the current solution isn't perfect, it can avoid the really bad situation as above. Best Regards, Huang, Ying
Michal Hocko <mhocko@kernel.org> writes: > On Wed 04-03-20 19:15:20, Huang, Ying wrote: >> Michal Hocko <mhocko@kernel.org> writes: >> >> > On Tue 03-03-20 19:49:53, Huang, Ying wrote: > [...] >> >> Because the penalty difference is so large, I >> >> think it may be a good idea to always put clean MADV_FREE pages at the >> >> tail of the inactive file LRU list? >> > >> > You are again making assumptions without giving any actual real >> > examples. Reconstructing MADV_FREE pages cost can differ a lot. >> >> In which situation the cost to reconstruct MADV_FREE pages can be higher >> than the cost to allocate file cache page and read from disk? Heavy >> contention on mmap_sem? > > Allocating a page might be really costly but even without that. File cache page needs to be allocated too. > Reconstructnig the content of the page can be really high and actually > much larger than an IO to get a page from the storage. Consider > decompression or some other transformations to get the content. > Userspace has means to detect that the content is still up-to-date as > already has been mentioned so all those steps can be avoided. Although I don't believe someone has really done this. It's possible at least in theory. Best Regards, Huang, Ying
On Wed, Mar 04, 2020 at 07:15:20PM +0800, Huang, Ying wrote: > In which situation the cost to reconstruct MADV_FREE pages can be higher > than the cost to allocate file cache page and read from disk? Heavy > contention on mmap_sem? > MADV_FREE should be anonymous only if (behavior == MADV_FREE) return madvise_free_single_vma(vma, start, end); ..... static int madvise_free_single_vma(struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr) { struct mm_struct *mm = vma->vm_mm; struct mmu_notifier_range range; struct mmu_gather tlb; /* MADV_FREE works for only anon vma at the moment */ if (!vma_is_anonymous(vma)) return -EINVAL So the question is not applicable. For anonymous memory, the cost of updating a PTE is lower than allocating a page, zeroing it and updating the PTE. It has been repeatedly stated now for almost a week that a semantic change to MADV_FREE should be based on a problem encountered by a real application that can benefit from the new semantics. I think the only concrete outcome has been that userspace potentially benefits if the total number of MADV_FREE pages is reported globally. Even that is marginal as smaps has the information to tell the difference between high RSS due to a memory leak and high RSS usage due to MADV_FREE. The /proc/vmstats for MADV_FREE are of marginal benefit given that they do not tell us much about the current number of MADV_FREE pages in the system.
Mel Gorman <mgorman@suse.de> writes: > On Wed, Mar 04, 2020 at 07:15:20PM +0800, Huang, Ying wrote: >> In which situation the cost to reconstruct MADV_FREE pages can be higher >> than the cost to allocate file cache page and read from disk? Heavy >> contention on mmap_sem? >> > > MADV_FREE should be anonymous only > > if (behavior == MADV_FREE) > return madvise_free_single_vma(vma, start, end); > > ..... > > static int madvise_free_single_vma(struct vm_area_struct *vma, > unsigned long start_addr, unsigned long end_addr) > { > struct mm_struct *mm = vma->vm_mm; > struct mmu_notifier_range range; > struct mmu_gather tlb; > > /* MADV_FREE works for only anon vma at the moment */ > if (!vma_is_anonymous(vma)) > return -EINVAL > > So the question is not applicable. For anonymous memory, the cost of > updating a PTE is lower than allocating a page, zeroing it and updating > the PTE. Sorry for confusing. The original question is NOT about comparing the reconstruction cost between MADV_FREE anon pages and MADV_FREE file pages, BUT about comparing the reconstruction cost between MADV_FREE anon pages and ordinary clean file cache pages. You can find more details in conversation between Michal and me. > It has been repeatedly stated now for almost a week that a semantic > change to MADV_FREE should be based on a problem encountered by a real > application that can benefit from the new semantics. I think the only > concrete outcome has been that userspace potentially benefits if the total > number of MADV_FREE pages is reported globally. Even that is marginal as > smaps has the information to tell the difference between high RSS due to > a memory leak and high RSS usage due to MADV_FREE. The /proc/vmstats for > MADV_FREE are of marginal benefit given that they do not tell us much > about the current number of MADV_FREE pages in the system. Got it! Thanks a lot for your patience and sharing. That's very helpful. Best Regards, Huang, Ying
Mel Gorman <mgorman@suse.de> writes: > I think the only concrete outcome has been that userspace potentially > benefits if the total number of MADV_FREE pages is reported > globally. Even that is marginal as smaps has the information to tell > the difference between high RSS due to a memory leak and high RSS > usage due to MADV_FREE. The /proc/vmstats for MADV_FREE are of > marginal benefit given that they do not tell us much about the current > number of MADV_FREE pages in the system. We can implement a counter for MADV_FREE that increases when we ClearPageSwapBacked() and decrease when we SetPageSwapBacked() for an anonymous page. But this cannot count lazily freed pages correctly. Because when a clean MDV_FREE page becomes dirty, there's no page fault so we will not be notified. And you have never run into the MADV_FREE issues other than the memory leaking debugging... Best Regards, Huang, Ying
From: Huang Ying <ying.huang@intel.com> MADV_FREE is a lazy free mechanism in Linux. According to the manpage of mavise(2), the semantics of MADV_FREE is, The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. ... Originally, the pages freed lazily by MADV_FREE will only be freed really by page reclaiming when there is memory pressure or when unmapping the address range. In addition to that, there's another opportunity to free these pages really, when we try to migrate them. The main value to do that is to avoid to create the new memory pressure immediately if possible. Instead, even if the pages are required again, they will be allocated gradually on demand. That is, the memory will be allocated lazily when necessary. This follows the common philosophy in the Linux kernel, allocate resources lazily on demand. This patchset implement this in addition to some cleanup to migration and MADV_FREE code. Best Regards, Huang, Ying