mbox series

[RFC,0/3] mm: Discard lazily freed pages when migrating

Message ID 20200228033819.3857058-1-ying.huang@intel.com (mailing list archive)
Headers show
Series mm: Discard lazily freed pages when migrating | expand

Message

Huang, Ying Feb. 28, 2020, 3:38 a.m. UTC
From: Huang Ying <ying.huang@intel.com>

MADV_FREE is a lazy free mechanism in Linux.  According to the manpage
of mavise(2), the semantics of MADV_FREE is,

  The application no longer requires the pages in the range specified
  by addr and len.  The kernel can thus free these pages, but the
  freeing could be delayed until memory pressure occurs. ...

Originally, the pages freed lazily by MADV_FREE will only be freed
really by page reclaiming when there is memory pressure or when
unmapping the address range.  In addition to that, there's another
opportunity to free these pages really, when we try to migrate them.

The main value to do that is to avoid to create the new memory
pressure immediately if possible.  Instead, even if the pages are
required again, they will be allocated gradually on demand.  That is,
the memory will be allocated lazily when necessary.  This follows the
common philosophy in the Linux kernel, allocate resources lazily on
demand.

This patchset implement this in addition to some cleanup to migration
and MADV_FREE code.

Best Regards,
Huang, Ying

Comments

Matthew Wilcox Feb. 28, 2020, 3:42 a.m. UTC | #1
On Fri, Feb 28, 2020 at 11:38:16AM +0800, Huang, Ying wrote:
> MADV_FREE is a lazy free mechanism in Linux.  According to the manpage
> of mavise(2), the semantics of MADV_FREE is,
> 
>   The application no longer requires the pages in the range specified
>   by addr and len.  The kernel can thus free these pages, but the
>   freeing could be delayed until memory pressure occurs. ...
> 
> Originally, the pages freed lazily by MADV_FREE will only be freed
> really by page reclaiming when there is memory pressure or when
> unmapping the address range.  In addition to that, there's another
> opportunity to free these pages really, when we try to migrate them.
> 
> The main value to do that is to avoid to create the new memory
> pressure immediately if possible.  Instead, even if the pages are
> required again, they will be allocated gradually on demand.  That is,
> the memory will be allocated lazily when necessary.  This follows the
> common philosophy in the Linux kernel, allocate resources lazily on
> demand.

Do you have an example program which does this (and so benefits)?
If so, can you quantify the benefit at all?
Huang, Ying Feb. 28, 2020, 7:25 a.m. UTC | #2
Hi, Matthew,

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Feb 28, 2020 at 11:38:16AM +0800, Huang, Ying wrote:
>> MADV_FREE is a lazy free mechanism in Linux.  According to the manpage
>> of mavise(2), the semantics of MADV_FREE is,
>> 
>>   The application no longer requires the pages in the range specified
>>   by addr and len.  The kernel can thus free these pages, but the
>>   freeing could be delayed until memory pressure occurs. ...
>> 
>> Originally, the pages freed lazily by MADV_FREE will only be freed
>> really by page reclaiming when there is memory pressure or when
>> unmapping the address range.  In addition to that, there's another
>> opportunity to free these pages really, when we try to migrate them.
>> 
>> The main value to do that is to avoid to create the new memory
>> pressure immediately if possible.  Instead, even if the pages are
>> required again, they will be allocated gradually on demand.  That is,
>> the memory will be allocated lazily when necessary.  This follows the
>> common philosophy in the Linux kernel, allocate resources lazily on
>> demand.
>
> Do you have an example program which does this (and so benefits)?

Sorry, what do you mean exactly for "this" here?  Call
madvise(,,MADV_FREE)?  Or migrate pages?

> If so, can you quantify the benefit at all?

The question is what is the right workload?  For example, I can build a
scenario as below to show benefit.

- run program A in node 0 with many lazily freed pages

- run program B in node 1, so that the free memory on node 1 is low

- migrate the program A from node 0 to node 1, so that the program B is
  influenced by the memory pressure created by migrating lazily freed
  pages.

Best Regards,
Huang, Ying
David Hildenbrand Feb. 28, 2020, 8:22 a.m. UTC | #3
On 28.02.20 08:25, Huang, Ying wrote:
> Hi, Matthew,
> 
> Matthew Wilcox <willy@infradead.org> writes:
> 
>> On Fri, Feb 28, 2020 at 11:38:16AM +0800, Huang, Ying wrote:
>>> MADV_FREE is a lazy free mechanism in Linux.  According to the manpage
>>> of mavise(2), the semantics of MADV_FREE is,
>>>
>>>   The application no longer requires the pages in the range specified
>>>   by addr and len.  The kernel can thus free these pages, but the
>>>   freeing could be delayed until memory pressure occurs. ...
>>>
>>> Originally, the pages freed lazily by MADV_FREE will only be freed
>>> really by page reclaiming when there is memory pressure or when
>>> unmapping the address range.  In addition to that, there's another
>>> opportunity to free these pages really, when we try to migrate them.
>>>
>>> The main value to do that is to avoid to create the new memory
>>> pressure immediately if possible.  Instead, even if the pages are
>>> required again, they will be allocated gradually on demand.  That is,
>>> the memory will be allocated lazily when necessary.  This follows the
>>> common philosophy in the Linux kernel, allocate resources lazily on
>>> demand.
>>
>> Do you have an example program which does this (and so benefits)?
> 
> Sorry, what do you mean exactly for "this" here?  Call
> madvise(,,MADV_FREE)?  Or migrate pages?
> 
>> If so, can you quantify the benefit at all?
> 
> The question is what is the right workload?  For example, I can build a
> scenario as below to show benefit.

We usually don't optimize for theoretical issues. Is there a real-life
workload you are trying to optimize this code for?

> 
> - run program A in node 0 with many lazily freed pages
> 
> - run program B in node 1, so that the free memory on node 1 is low
> 
> - migrate the program A from node 0 to node 1, so that the program B is
>   influenced by the memory pressure created by migrating lazily freed
>   pages.
> 

E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will
report currently free pages to the hypervisor, which will MADV_FREE the
reported memory. As long as there is no memory pressure, there is no
need to actually free the pages. Once the guest reuses such a page, it
could happen that there is still the old page and pulling in in a fresh
(zeroed) page can be avoided.

AFAIKs, after your change, we would get more pages discarded from our
guest, resulting in more fresh (zeroed) pages having to be pulled in
when a guest touches a reported free page again. But OTOH, page
migration is speed up (avoiding to migrate these pages).

However, one important question, will you always discard memory when
migrating pages, or only if there is memory pressure on the migration
target?
Huang, Ying Feb. 28, 2020, 8:55 a.m. UTC | #4
David Hildenbrand <david@redhat.com> writes:

> On 28.02.20 08:25, Huang, Ying wrote:
>> Hi, Matthew,
>> 
>> Matthew Wilcox <willy@infradead.org> writes:
>> 
>>> On Fri, Feb 28, 2020 at 11:38:16AM +0800, Huang, Ying wrote:
>>>> MADV_FREE is a lazy free mechanism in Linux.  According to the manpage
>>>> of mavise(2), the semantics of MADV_FREE is,
>>>>
>>>>   The application no longer requires the pages in the range specified
>>>>   by addr and len.  The kernel can thus free these pages, but the
>>>>   freeing could be delayed until memory pressure occurs. ...
>>>>
>>>> Originally, the pages freed lazily by MADV_FREE will only be freed
>>>> really by page reclaiming when there is memory pressure or when
>>>> unmapping the address range.  In addition to that, there's another
>>>> opportunity to free these pages really, when we try to migrate them.
>>>>
>>>> The main value to do that is to avoid to create the new memory
>>>> pressure immediately if possible.  Instead, even if the pages are
>>>> required again, they will be allocated gradually on demand.  That is,
>>>> the memory will be allocated lazily when necessary.  This follows the
>>>> common philosophy in the Linux kernel, allocate resources lazily on
>>>> demand.
>>>
>>> Do you have an example program which does this (and so benefits)?
>> 
>> Sorry, what do you mean exactly for "this" here?  Call
>> madvise(,,MADV_FREE)?  Or migrate pages?
>> 
>>> If so, can you quantify the benefit at all?
>> 
>> The question is what is the right workload?  For example, I can build a
>> scenario as below to show benefit.
>
> We usually don't optimize for theoretical issues. Is there a real-life
> workload you are trying to optimize this code for?

We don't use a specific workload because we thought this is a general
optimization.  I will explain more later in this email.

>> 
>> - run program A in node 0 with many lazily freed pages
>> 
>> - run program B in node 1, so that the free memory on node 1 is low
>> 
>> - migrate the program A from node 0 to node 1, so that the program B is
>>   influenced by the memory pressure created by migrating lazily freed
>>   pages.
>> 
>
> E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will
> report currently free pages to the hypervisor, which will MADV_FREE the
> reported memory. As long as there is no memory pressure, there is no
> need to actually free the pages. Once the guest reuses such a page, it
> could happen that there is still the old page and pulling in in a fresh
> (zeroed) page can be avoided.
>
> AFAIKs, after your change, we would get more pages discarded from our
> guest, resulting in more fresh (zeroed) pages having to be pulled in
> when a guest touches a reported free page again. But OTOH, page
> migration is speed up (avoiding to migrate these pages).

Let's look at this problem in another perspective.  To migrate the
MADV_FREE pages of the QEMU process from the node A to the node B, we
need to free the original pages in the node A, and (maybe) allocate the
same number of pages in the node B.  So the question becomes

- we may need to allocate some pages in the node B
- these pages may be accessed by the application or not
- we should allocate all these pages in advance or allocate them lazily
  when they are accessed.

We thought the common philosophy in Linux kernel is to allocate lazily.

That is, because we will always free the original pages in the node A,
the question isn't whether we should free these MADV_FREE pages, but
whether we should allocate the same number of pages in the node B before
we know whether they are really needed.  We thought this is similar as
whether we should allocate all physical pages when mmap().

> However, one important question, will you always discard memory when
> migrating pages, or only if there is memory pressure on the migration
> target?

We will always discard memory when migrating pages.  Our reasoning is as
above.

Best Regards,
Huang, Ying
Mel Gorman Feb. 28, 2020, 9:49 a.m. UTC | #5
On Fri, Feb 28, 2020 at 04:55:40PM +0800, Huang, Ying wrote:
> > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will
> > report currently free pages to the hypervisor, which will MADV_FREE the
> > reported memory. As long as there is no memory pressure, there is no
> > need to actually free the pages. Once the guest reuses such a page, it
> > could happen that there is still the old page and pulling in in a fresh
> > (zeroed) page can be avoided.
> >
> > AFAIKs, after your change, we would get more pages discarded from our
> > guest, resulting in more fresh (zeroed) pages having to be pulled in
> > when a guest touches a reported free page again. But OTOH, page
> > migration is speed up (avoiding to migrate these pages).
> 
> Let's look at this problem in another perspective.  To migrate the
> MADV_FREE pages of the QEMU process from the node A to the node B, we
> need to free the original pages in the node A, and (maybe) allocate the
> same number of pages in the node B.  So the question becomes
> 
> - we may need to allocate some pages in the node B
> - these pages may be accessed by the application or not
> - we should allocate all these pages in advance or allocate them lazily
>   when they are accessed.
> 
> We thought the common philosophy in Linux kernel is to allocate lazily.
> 

I also think there needs to be an example of a real application that
benefits from this new behaviour. Consider the possible sources of page
migration

1. NUMA balancing -- The application has to read/write the data for this
   to trigger. In the case of write, MADV_FREE is cancelled and it's
   mostly likely going to be a write unless it's an application bug.
2. sys_movepages -- the application has explictly stated the data is in
   use on a particular node yet any MADV_FREE page gets discarded
3. Compaction -- there may be no memory pressure at all but the
   MADV_FREE memory is discarded prematurely

In the first case, the data is explicitly in use, most likely due to
a write in which case it's inappropriate to discard. Discarding and
reallocating a zero'd page is not expected. In second case, the data is
likely in use or else why would the system call be used? In the third case
the timing of when MADV_FREE pages disappear is arbitrary as it can happen
without any actual memory pressure.  This may or may not be problematic but
it leads to unpredictable latencies for applications that use MADV_FREE
for a quick malloc/free implementation.  Before, as long as there is no
pressure, the reuse of a MADV_FREE incurs just a small penalty but now
with compaction it does not matter if the system avoids memory pressure
because they may still incur a fault to allocate and zero a new page.

There is a hypothetical fourth case which I only mention because of your
email address. If persistent memory is ever used for tiered memory then
MADV_FREE pages that migrate from dram to pmem gets discarded instead
of migrated. When it's reused, it gets reallocated from dram regardless
of whether that region is hot or not.  This may lead to an odd scenario
whereby applications occupy dram prematurely due to a single reference
of a MADV_FREE page.

It's all subtle enough that we really should have an example application
in mind that benefits so we can weigh the benefits against the potential
risks.
Michal Hocko Feb. 28, 2020, 9:50 a.m. UTC | #6
On Fri 28-02-20 16:55:40, Huang, Ying wrote:
> David Hildenbrand <david@redhat.com> writes:
[...]
> > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will
> > report currently free pages to the hypervisor, which will MADV_FREE the
> > reported memory. As long as there is no memory pressure, there is no
> > need to actually free the pages. Once the guest reuses such a page, it
> > could happen that there is still the old page and pulling in in a fresh
> > (zeroed) page can be avoided.
> >
> > AFAIKs, after your change, we would get more pages discarded from our
> > guest, resulting in more fresh (zeroed) pages having to be pulled in
> > when a guest touches a reported free page again. But OTOH, page
> > migration is speed up (avoiding to migrate these pages).
> 
> Let's look at this problem in another perspective.  To migrate the
> MADV_FREE pages of the QEMU process from the node A to the node B, we
> need to free the original pages in the node A, and (maybe) allocate the
> same number of pages in the node B.  So the question becomes
> 
> - we may need to allocate some pages in the node B
> - these pages may be accessed by the application or not
> - we should allocate all these pages in advance or allocate them lazily
>   when they are accessed.
> 
> We thought the common philosophy in Linux kernel is to allocate lazily.

The common philosophy is to cache as much as possible. And MADV_FREE
pages are a kind of cache as well. If the target node is short on memory
then those will be reclaimed as a cache so a pro-active freeing sounds
counter productive as you do not have any idea whether that cache is
going to be used in future. In other words you are not going to free a
clean page cache if you want to use that memory as a migration target
right? So you should make a clear case about why MADV_FREE cache is less
important than the clean page cache and ideally have a good
justification backed by real workloads.
Michal Hocko Feb. 28, 2020, 10:15 a.m. UTC | #7
On Fri 28-02-20 10:50:50, Michal Hocko wrote:
> On Fri 28-02-20 16:55:40, Huang, Ying wrote:
> > David Hildenbrand <david@redhat.com> writes:
> [...]
> > > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will
> > > report currently free pages to the hypervisor, which will MADV_FREE the
> > > reported memory. As long as there is no memory pressure, there is no
> > > need to actually free the pages. Once the guest reuses such a page, it
> > > could happen that there is still the old page and pulling in in a fresh
> > > (zeroed) page can be avoided.
> > >
> > > AFAIKs, after your change, we would get more pages discarded from our
> > > guest, resulting in more fresh (zeroed) pages having to be pulled in
> > > when a guest touches a reported free page again. But OTOH, page
> > > migration is speed up (avoiding to migrate these pages).
> > 
> > Let's look at this problem in another perspective.  To migrate the
> > MADV_FREE pages of the QEMU process from the node A to the node B, we
> > need to free the original pages in the node A, and (maybe) allocate the
> > same number of pages in the node B.  So the question becomes
> > 
> > - we may need to allocate some pages in the node B
> > - these pages may be accessed by the application or not
> > - we should allocate all these pages in advance or allocate them lazily
> >   when they are accessed.
> > 
> > We thought the common philosophy in Linux kernel is to allocate lazily.
> 
> The common philosophy is to cache as much as possible. And MADV_FREE
> pages are a kind of cache as well. If the target node is short on memory
> then those will be reclaimed as a cache so a pro-active freeing sounds
> counter productive as you do not have any idea whether that cache is
> going to be used in future. In other words you are not going to free a
> clean page cache if you want to use that memory as a migration target

sorry I meant to say that you are not going to clean page cache when
migrating it.

> right? So you should make a clear case about why MADV_FREE cache is less
> important than the clean page cache and ideally have a good
> justification backed by real workloads.
Johannes Weiner Feb. 28, 2020, 1:45 p.m. UTC | #8
On Fri, Feb 28, 2020 at 10:50:48AM +0100, Michal Hocko wrote:
> On Fri 28-02-20 16:55:40, Huang, Ying wrote:
> > David Hildenbrand <david@redhat.com> writes:
> [...]
> > > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will
> > > report currently free pages to the hypervisor, which will MADV_FREE the
> > > reported memory. As long as there is no memory pressure, there is no
> > > need to actually free the pages. Once the guest reuses such a page, it
> > > could happen that there is still the old page and pulling in in a fresh
> > > (zeroed) page can be avoided.
> > >
> > > AFAIKs, after your change, we would get more pages discarded from our
> > > guest, resulting in more fresh (zeroed) pages having to be pulled in
> > > when a guest touches a reported free page again. But OTOH, page
> > > migration is speed up (avoiding to migrate these pages).
> > 
> > Let's look at this problem in another perspective.  To migrate the
> > MADV_FREE pages of the QEMU process from the node A to the node B, we
> > need to free the original pages in the node A, and (maybe) allocate the
> > same number of pages in the node B.  So the question becomes
> > 
> > - we may need to allocate some pages in the node B
> > - these pages may be accessed by the application or not
> > - we should allocate all these pages in advance or allocate them lazily
> >   when they are accessed.
> > 
> > We thought the common philosophy in Linux kernel is to allocate lazily.
> 
> The common philosophy is to cache as much as possible. And MADV_FREE
> pages are a kind of cache as well. If the target node is short on memory
> then those will be reclaimed as a cache so a pro-active freeing sounds
> counter productive as you do not have any idea whether that cache is
> going to be used in future. In other words you are not going to free a
> clean page cache if you want to use that memory as a migration target
> right? So you should make a clear case about why MADV_FREE cache is less
> important than the clean page cache and ideally have a good
> justification backed by real workloads.

Agreed.

MADV_FREE says that the *data* in the pages is no longer needed, and
so the pages are cheaper to reclaim than regular anon pages (swap).

But people use MADV_FREE in the hope that they can reuse the pages at
a later point - otherwise, they'd unmap them. We should retain them
until the memory is actively needed for other allocations.
Huang, Ying March 2, 2020, 11:23 a.m. UTC | #9
Hi, Mel,

Mel Gorman <mgorman@suse.de> writes:

> On Fri, Feb 28, 2020 at 04:55:40PM +0800, Huang, Ying wrote:
>> > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will
>> > report currently free pages to the hypervisor, which will MADV_FREE the
>> > reported memory. As long as there is no memory pressure, there is no
>> > need to actually free the pages. Once the guest reuses such a page, it
>> > could happen that there is still the old page and pulling in in a fresh
>> > (zeroed) page can be avoided.
>> >
>> > AFAIKs, after your change, we would get more pages discarded from our
>> > guest, resulting in more fresh (zeroed) pages having to be pulled in
>> > when a guest touches a reported free page again. But OTOH, page
>> > migration is speed up (avoiding to migrate these pages).
>> 
>> Let's look at this problem in another perspective.  To migrate the
>> MADV_FREE pages of the QEMU process from the node A to the node B, we
>> need to free the original pages in the node A, and (maybe) allocate the
>> same number of pages in the node B.  So the question becomes
>> 
>> - we may need to allocate some pages in the node B
>> - these pages may be accessed by the application or not
>> - we should allocate all these pages in advance or allocate them lazily
>>   when they are accessed.
>> 
>> We thought the common philosophy in Linux kernel is to allocate lazily.
>> 
>
> I also think there needs to be an example of a real application that
> benefits from this new behaviour. Consider the possible sources of page
> migration
>
> 1. NUMA balancing -- The application has to read/write the data for this
>    to trigger. In the case of write, MADV_FREE is cancelled and it's
>    mostly likely going to be a write unless it's an application bug.
> 2. sys_movepages -- the application has explictly stated the data is in
>    use on a particular node yet any MADV_FREE page gets discarded
> 3. Compaction -- there may be no memory pressure at all but the
>    MADV_FREE memory is discarded prematurely
>
> In the first case, the data is explicitly in use, most likely due to
> a write in which case it's inappropriate to discard. Discarding and
> reallocating a zero'd page is not expected.

If my understanding were correct, NUMA balancing will not try to migrate
clean MADV_FREE pages most of the times.  Because the lazily freed pages
may be zeroed at any time, it makes almost no sense to read them.  So,
the first access after being freed lazily should be writing.  That will
make the page dirty MADV_FREE, so will not be discarded.  And the first
writing access in a new node will not trigger migration usually because
of the two-stage filter in should_numa_migrate_memory().  So current
behavior in case 1 will not change most of the times after the patchset.

> In second case, the data is likely in use or else why would the system
> call be used?

If the pages were in use, they shouldn't be clean MADV_FREE pages.  So
no behavior change after this patchset.

There is at least one use case to move the clean MADV_FREE pages (that
is, not in use), when we move an application from one node to another
node via something like /usr/bin/migratepages.  I think this is kind of
NUMA balancing by hand.  It appears reasonable to discard clean
MADV_FREE pages at least in some situations.

> In the third case the timing of when MADV_FREE pages disappear is
> arbitrary as it can happen without any actual memory pressure.  This
> may or may not be problematic but it leads to unpredictable latencies
> for applications that use MADV_FREE for a quick malloc/free
> implementation.  Before, as long as there is no pressure, the reuse of
> a MADV_FREE incurs just a small penalty but now with compaction it
> does not matter if the system avoids memory pressure because they may
> still incur a fault to allocate and zero a new page.

Yes.  This is a real problem.  Previously we thought that the migration
is kind of

- unmap and free the old pages
- map the new pages

If we can allocate new pages lazily in mmap(), why cannot we allocate
new pages lazily in migrating pages too if possible (for clean
MADV_CLEAN pages) because its second stage is kind of mmap() too.  But
you remind us that there are some differences,

- mmap() is called by the application directly, so its effect is
  predictable.  While some migration like that in the compaction isn't
  called by the application, so may have unpredictable behavior.
- With mmap(), the application can choose to allocate new pages lazily
  or eagerly.  But we lacks the same mechanism in this patchset.

So, maybe we can make the mechanism more flexible.  That is, let the
administrator or the application choose the right behavior for them, via
some system level configuration knob or API flags.  For example, if the
memory allocation latency isn't critical for the workloads, the
administrator can choose to discard clean MADV_FREE pages to make
compaction quicker and easier?

> There is a hypothetical fourth case which I only mention because of your
> email address. If persistent memory is ever used for tiered memory then
> MADV_FREE pages that migrate from dram to pmem gets discarded instead
> of migrated. When it's reused, it gets reallocated from dram regardless
> of whether that region is hot or not.

Thanks for mentioning this use case because we are working on this too.

Firstly, the memory writing bandwidth of the persistent memory is much
less than its reading bandwidth, so it is very valuable to avoid writing
persistent memory unnecessarily.  Discarding instead of migrating here
can help saving the memory writing bandwidth of the persistent memory.

Secondly, we will always allocate pages in DRAM before knowing they are
hot or cold if possible, and only demote the known cold pages to the
persistent memory.  Only when the accessing pattern of these cold pages
changes to be hot, we will promote them back to DRAM.  In this way, the
always hot pages can be kept in DRAM from begin to end.  After
MADV_FREE, the contents may be zeroed, so its access pattern changes to
be unknown too (e.g. free an object, then allocate a different object).
If we want to follow the above rule, we should put the new page with
unknown temperature in DRAM to give it an opportunity.

> This may lead to an odd scenario whereby applications occupy dram
> prematurely due to a single reference of a MADV_FREE page.

The better way to deal with this is to enhance mem_cgroup to limit DRAM
usage in a memory tiering system?

> It's all subtle enough that we really should have an example application
> in mind that benefits so we can weigh the benefits against the potential
> risks.

If some applications cannot tolerate the latency incurred by the memory
allocation and zeroing.  Then we cannot discard instead of migrate
always.  While in some situations, less memory pressure can help.  So
it's better to let the administrator and the application choose the
right behavior in the specific situation?

Best Regards,
Huang, Ying
Huang, Ying March 2, 2020, 2:12 p.m. UTC | #10
Michal Hocko <mhocko@kernel.org> writes:

> On Fri 28-02-20 16:55:40, Huang, Ying wrote:
>> David Hildenbrand <david@redhat.com> writes:
> [...]
>> > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will
>> > report currently free pages to the hypervisor, which will MADV_FREE the
>> > reported memory. As long as there is no memory pressure, there is no
>> > need to actually free the pages. Once the guest reuses such a page, it
>> > could happen that there is still the old page and pulling in in a fresh
>> > (zeroed) page can be avoided.
>> >
>> > AFAIKs, after your change, we would get more pages discarded from our
>> > guest, resulting in more fresh (zeroed) pages having to be pulled in
>> > when a guest touches a reported free page again. But OTOH, page
>> > migration is speed up (avoiding to migrate these pages).
>> 
>> Let's look at this problem in another perspective.  To migrate the
>> MADV_FREE pages of the QEMU process from the node A to the node B, we
>> need to free the original pages in the node A, and (maybe) allocate the
>> same number of pages in the node B.  So the question becomes
>> 
>> - we may need to allocate some pages in the node B
>> - these pages may be accessed by the application or not
>> - we should allocate all these pages in advance or allocate them lazily
>>   when they are accessed.
>> 
>> We thought the common philosophy in Linux kernel is to allocate lazily.
>
> The common philosophy is to cache as much as possible.

Yes.  This is another common philosophy.  And MADV_FREE pages is
different from caches such as the page caches because it has no valid
contents.  And this patchset doesn't disable MADV_FREE mechanism.  It
just change the migration behavior.  So MADV_FREE pages will be kept
until reclaiming most of the times.

> And MADV_FREE pages are a kind of cache as well. If the target node is
> short on memory then those will be reclaimed as a cache so a
> pro-active freeing sounds counter productive as you do not have any
> idea whether that cache is going to be used in future. In other words
> you are not going to free a clean page cache if you want to use that
> memory as a migration target right? So you should make a clear case
> about why MADV_FREE cache is less important than the clean page cache
> and ideally have a good justification backed by real workloads.

Clean page cache still have valid contents, while clean MADV_FREE pages
has no valid contents.  So penalty of discarding the clean page cache is
reading from disk, while the penalty of discarding clean MADV_FREE pages
is just page allocation and zeroing.

I understand that MADV_FREE is another kind of cache and has its value.
But in the original implementation, during migration, we have already
freed the original "cache", then reallocate the cache elsewhere and
copy.  This appears more like all pages are populated in mmap() always.
I know there's value to populate all pages in mmap(), but does that need
to be done always by default?

Best Regards,
Huang, Ying
David Hildenbrand March 2, 2020, 2:23 p.m. UTC | #11
On 02.03.20 15:12, Huang, Ying wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
>> On Fri 28-02-20 16:55:40, Huang, Ying wrote:
>>> David Hildenbrand <david@redhat.com> writes:
>> [...]
>>>> E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will
>>>> report currently free pages to the hypervisor, which will MADV_FREE the
>>>> reported memory. As long as there is no memory pressure, there is no
>>>> need to actually free the pages. Once the guest reuses such a page, it
>>>> could happen that there is still the old page and pulling in in a fresh
>>>> (zeroed) page can be avoided.
>>>>
>>>> AFAIKs, after your change, we would get more pages discarded from our
>>>> guest, resulting in more fresh (zeroed) pages having to be pulled in
>>>> when a guest touches a reported free page again. But OTOH, page
>>>> migration is speed up (avoiding to migrate these pages).
>>>
>>> Let's look at this problem in another perspective.  To migrate the
>>> MADV_FREE pages of the QEMU process from the node A to the node B, we
>>> need to free the original pages in the node A, and (maybe) allocate the
>>> same number of pages in the node B.  So the question becomes
>>>
>>> - we may need to allocate some pages in the node B
>>> - these pages may be accessed by the application or not
>>> - we should allocate all these pages in advance or allocate them lazily
>>>   when they are accessed.
>>>
>>> We thought the common philosophy in Linux kernel is to allocate lazily.
>>
>> The common philosophy is to cache as much as possible.
> 
> Yes.  This is another common philosophy.  And MADV_FREE pages is
> different from caches such as the page caches because it has no valid
> contents.

Side note: It might contain valid content until discarded/zeroed out.
E.g., an application could use a marker bit (e.g., first bit) to detect
if the page still contains valid data or not. If the data is still
marked valid, the content could be reuse immediately. Not sure if there
is any such application, though :)
Michal Hocko March 2, 2020, 2:25 p.m. UTC | #12
On Mon 02-03-20 22:12:53, Huang, Ying wrote:
> Michal Hocko <mhocko@kernel.org> writes:
[...]
> > And MADV_FREE pages are a kind of cache as well. If the target node is
> > short on memory then those will be reclaimed as a cache so a
> > pro-active freeing sounds counter productive as you do not have any
> > idea whether that cache is going to be used in future. In other words
> > you are not going to free a clean page cache if you want to use that
> > memory as a migration target right? So you should make a clear case
> > about why MADV_FREE cache is less important than the clean page cache
> > and ideally have a good justification backed by real workloads.
> 
> Clean page cache still have valid contents, while clean MADV_FREE pages
> has no valid contents.  So penalty of discarding the clean page cache is
> reading from disk, while the penalty of discarding clean MADV_FREE pages
> is just page allocation and zeroing.

And "just page allocation and zeroing" overhead is the primary
motivation to keep the page in memory. It is a decision of the workload
to use MADV_FREE because chances are that this will speed things up. All
that with a contract that the memory goes away under memory pressure so
with a good workload/memory sizing you do not really lose that
optimization. Now you want to make decision on behalf of the consumer of
the MADV_FREE memory.

> I understand that MADV_FREE is another kind of cache and has its value.
> But in the original implementation, during migration, we have already
> freed the original "cache", then reallocate the cache elsewhere and
> copy.  This appears more like all pages are populated in mmap() always.
> I know there's value to populate all pages in mmap(), but does that need
> to be done always by default?

It is not. You have to explicitly request MAP_POPULATE to initialize
mmap.
Mel Gorman March 2, 2020, 3:16 p.m. UTC | #13
On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote:
> Mel Gorman <mgorman@suse.de> writes:
> 
> > On Fri, Feb 28, 2020 at 04:55:40PM +0800, Huang, Ying wrote:
> >> > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will
> >> > report currently free pages to the hypervisor, which will MADV_FREE the
> >> > reported memory. As long as there is no memory pressure, there is no
> >> > need to actually free the pages. Once the guest reuses such a page, it
> >> > could happen that there is still the old page and pulling in in a fresh
> >> > (zeroed) page can be avoided.
> >> >
> >> > AFAIKs, after your change, we would get more pages discarded from our
> >> > guest, resulting in more fresh (zeroed) pages having to be pulled in
> >> > when a guest touches a reported free page again. But OTOH, page
> >> > migration is speed up (avoiding to migrate these pages).
> >> 
> >> Let's look at this problem in another perspective.  To migrate the
> >> MADV_FREE pages of the QEMU process from the node A to the node B, we
> >> need to free the original pages in the node A, and (maybe) allocate the
> >> same number of pages in the node B.  So the question becomes
> >> 
> >> - we may need to allocate some pages in the node B
> >> - these pages may be accessed by the application or not
> >> - we should allocate all these pages in advance or allocate them lazily
> >>   when they are accessed.
> >> 
> >> We thought the common philosophy in Linux kernel is to allocate lazily.
> >> 
> >
> > I also think there needs to be an example of a real application that
> > benefits from this new behaviour. Consider the possible sources of page
> > migration
> >
> > 1. NUMA balancing -- The application has to read/write the data for this
> >    to trigger. In the case of write, MADV_FREE is cancelled and it's
> >    mostly likely going to be a write unless it's an application bug.
> > 2. sys_movepages -- the application has explictly stated the data is in
> >    use on a particular node yet any MADV_FREE page gets discarded
> > 3. Compaction -- there may be no memory pressure at all but the
> >    MADV_FREE memory is discarded prematurely
> >
> > In the first case, the data is explicitly in use, most likely due to
> > a write in which case it's inappropriate to discard. Discarding and
> > reallocating a zero'd page is not expected.
> 
> If my understanding were correct, NUMA balancing will not try to migrate
> clean MADV_FREE pages most of the times.  Because the lazily freed pages
> may be zeroed at any time, it makes almost no sense to read them.  So,
> the first access after being freed lazily should be writing.  That will
> make the page dirty MADV_FREE, so will not be discarded.  And the first
> writing access in a new node will not trigger migration usually because
> of the two-stage filter in should_numa_migrate_memory().  So current
> behavior in case 1 will not change most of the times after the patchset.
> 

Yes, so in this case the series neither helps nor hurts NUMA balancing.

> > In second case, the data is likely in use or else why would the system
> > call be used?
> 
> If the pages were in use, they shouldn't be clean MADV_FREE pages.  So
> no behavior change after this patchset.
> 

You cannot guarantee that. The application could be caching them
optimistically as long as they stay resident until memory pressure
forces them out. Consider something like an in-memory object database.
For very old objects, it might decide to mark them MADV_FREE using a
header at the start to detect when a page still has valid data. The
intent would be that under memory pressure, the hot data would be
preserved as long as possible. I'm not aware of such an application, it
simply is a valid use case.

Similarly, a malloc implementation may be using MADV_FREE to mark freed
buffers so they can be quickly reused. Now if they use sys_movepages,
they take the alloc+zero hit and the strategy is less useful.

> > In the third case the timing of when MADV_FREE pages disappear is
> > arbitrary as it can happen without any actual memory pressure.  This
> > may or may not be problematic but it leads to unpredictable latencies
> > for applications that use MADV_FREE for a quick malloc/free
> > implementation.  Before, as long as there is no pressure, the reuse of
> > a MADV_FREE incurs just a small penalty but now with compaction it
> > does not matter if the system avoids memory pressure because they may
> > still incur a fault to allocate and zero a new page.
> 
> Yes.  This is a real problem.  Previously we thought that the migration
> is kind of
> 
> - unmap and free the old pages
> - map the new pages
> 
> If we can allocate new pages lazily in mmap(), why cannot we allocate
> new pages lazily in migrating pages too if possible (for clean
> MADV_CLEAN pages) because its second stage is kind of mmap() too.  But
> you remind us that there are some differences,
> 
> - mmap() is called by the application directly, so its effect is
>   predictable.  While some migration like that in the compaction isn't
>   called by the application, so may have unpredictable behavior.
> - With mmap(), the application can choose to allocate new pages lazily
>   or eagerly.  But we lacks the same mechanism in this patchset.
> 
> So, maybe we can make the mechanism more flexible.  That is, let the
> administrator or the application choose the right behavior for them, via
> some system level configuration knob or API flags.  For example, if the
> memory allocation latency isn't critical for the workloads, the
> administrator can choose to discard clean MADV_FREE pages to make
> compaction quicker and easier?
> 

That will be the type of knob that is almost impossible to tune for.
More on this later.

> > <SNIP>
> >
> > This may lead to an odd scenario whereby applications occupy dram
> > prematurely due to a single reference of a MADV_FREE page.
> 
> The better way to deal with this is to enhance mem_cgroup to limit DRAM
> usage in a memory tiering system?
> 

I did not respond to this properly because it's a separate discussion on
implementation details of something that does not exist. I only
mentioned it to highlight a potential hazard that could raise its head
later.

> > It's all subtle enough that we really should have an example application
> > in mind that benefits so we can weigh the benefits against the potential
> > risks.
> 
> If some applications cannot tolerate the latency incurred by the memory
> allocation and zeroing.  Then we cannot discard instead of migrate
> always.  While in some situations, less memory pressure can help.  So
> it's better to let the administrator and the application choose the
> right behavior in the specific situation?
> 

Is there an application you have in mind that benefits from discarding
MADV_FREE pages instead of migrating them?

Allowing the administrator or application to tune this would be very
problematic. An application would require an update to the system call
to take advantage of it and then detect if the running kernel supports
it. An administrator would have to detect that MADV_FREE pages are being
prematurely discarded leading to a slowdown and that is hard to detect.
It could be inferred from monitoring compaction stats and checking
if compaction activity is correlated with higher minor faults in the
target application. Proving the correlation would require using the perf
software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses
to MADV_FREE regions that were freed prematurely. That is not an obvious
debugging step to take when an application detects latency spikes.

Now, you could add a counter specifically for MADV_FREE pages freed for
reasons other than memory pressure and hope the administrator knows about
the counter and what it means. That type of knowledge could take a long
time to spread so it's really very important that there is evidence of
an application that suffers due to the current MADV_FREE and migration
behaviour.
Huang, Ying March 3, 2020, 12:25 a.m. UTC | #14
David Hildenbrand <david@redhat.com> writes:

> On 02.03.20 15:12, Huang, Ying wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>>> On Fri 28-02-20 16:55:40, Huang, Ying wrote:
>>>> David Hildenbrand <david@redhat.com> writes:
>>> [...]
>>>>> E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will
>>>>> report currently free pages to the hypervisor, which will MADV_FREE the
>>>>> reported memory. As long as there is no memory pressure, there is no
>>>>> need to actually free the pages. Once the guest reuses such a page, it
>>>>> could happen that there is still the old page and pulling in in a fresh
>>>>> (zeroed) page can be avoided.
>>>>>
>>>>> AFAIKs, after your change, we would get more pages discarded from our
>>>>> guest, resulting in more fresh (zeroed) pages having to be pulled in
>>>>> when a guest touches a reported free page again. But OTOH, page
>>>>> migration is speed up (avoiding to migrate these pages).
>>>>
>>>> Let's look at this problem in another perspective.  To migrate the
>>>> MADV_FREE pages of the QEMU process from the node A to the node B, we
>>>> need to free the original pages in the node A, and (maybe) allocate the
>>>> same number of pages in the node B.  So the question becomes
>>>>
>>>> - we may need to allocate some pages in the node B
>>>> - these pages may be accessed by the application or not
>>>> - we should allocate all these pages in advance or allocate them lazily
>>>>   when they are accessed.
>>>>
>>>> We thought the common philosophy in Linux kernel is to allocate lazily.
>>>
>>> The common philosophy is to cache as much as possible.
>> 
>> Yes.  This is another common philosophy.  And MADV_FREE pages is
>> different from caches such as the page caches because it has no valid
>> contents.
>
> Side note: It might contain valid content until discarded/zeroed out.
> E.g., an application could use a marker bit (e.g., first bit) to detect
> if the page still contains valid data or not. If the data is still
> marked valid, the content could be reuse immediately. Not sure if there
> is any such application, though :)

I don't think this is the typical use case.  But I admit that this is
possible.

Best Regards,
Huang, Ying
Huang, Ying March 3, 2020, 1:30 a.m. UTC | #15
Michal Hocko <mhocko@kernel.org> writes:

> On Mon 02-03-20 22:12:53, Huang, Ying wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
> [...]
>> > And MADV_FREE pages are a kind of cache as well. If the target node is
>> > short on memory then those will be reclaimed as a cache so a
>> > pro-active freeing sounds counter productive as you do not have any
>> > idea whether that cache is going to be used in future. In other words
>> > you are not going to free a clean page cache if you want to use that
>> > memory as a migration target right? So you should make a clear case
>> > about why MADV_FREE cache is less important than the clean page cache
>> > and ideally have a good justification backed by real workloads.
>> 
>> Clean page cache still have valid contents, while clean MADV_FREE pages
>> has no valid contents.  So penalty of discarding the clean page cache is
>> reading from disk, while the penalty of discarding clean MADV_FREE pages
>> is just page allocation and zeroing.
>
> And "just page allocation and zeroing" overhead is the primary
> motivation to keep the page in memory. It is a decision of the workload
> to use MADV_FREE because chances are that this will speed things up. All
> that with a contract that the memory goes away under memory pressure so
> with a good workload/memory sizing you do not really lose that
> optimization. Now you want to make decision on behalf of the consumer of
> the MADV_FREE memory.

I understand that MADV_FREE helps in some situations.  And if the
application want to keep the "contract" after migration, they should
have a way to do that.

>> I understand that MADV_FREE is another kind of cache and has its value.
>> But in the original implementation, during migration, we have already
>> freed the original "cache", then reallocate the cache elsewhere and
>> copy.  This appears more like all pages are populated in mmap() always.
>> I know there's value to populate all pages in mmap(), but does that need
>> to be done always by default?
>
> It is not. You have to explicitly request MAP_POPULATE to initialize
> mmap.

Yes.  mmap() can control whether to populate the underlying physical
pages.  But for migrating MADV_FREE pages, there's no control, all pages
will be populated again always by default.  Maybe we should avoid to do
that in some situations too.

Best Regards,
Huang, Ying
Huang, Ying March 3, 2020, 1:51 a.m. UTC | #16
Mel Gorman <mgorman@suse.de> writes:
> On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote:
>> If some applications cannot tolerate the latency incurred by the memory
>> allocation and zeroing.  Then we cannot discard instead of migrate
>> always.  While in some situations, less memory pressure can help.  So
>> it's better to let the administrator and the application choose the
>> right behavior in the specific situation?
>> 
>
> Is there an application you have in mind that benefits from discarding
> MADV_FREE pages instead of migrating them?
>
> Allowing the administrator or application to tune this would be very
> problematic. An application would require an update to the system call
> to take advantage of it and then detect if the running kernel supports
> it. An administrator would have to detect that MADV_FREE pages are being
> prematurely discarded leading to a slowdown and that is hard to detect.
> It could be inferred from monitoring compaction stats and checking
> if compaction activity is correlated with higher minor faults in the
> target application. Proving the correlation would require using the perf
> software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses
> to MADV_FREE regions that were freed prematurely. That is not an obvious
> debugging step to take when an application detects latency spikes.
>
> Now, you could add a counter specifically for MADV_FREE pages freed for
> reasons other than memory pressure and hope the administrator knows about
> the counter and what it means. That type of knowledge could take a long
> time to spread so it's really very important that there is evidence of
> an application that suffers due to the current MADV_FREE and migration
> behaviour.

OK.  I understand that this patchset isn't a universal win, so we need
some way to justify it.  I will try to find some application for that.

Another thought, as proposed by David Hildenbrand, it's may be a
universal win to discard clean MADV_FREE pages when migrating if there are
already memory pressure on the target node.  For example, if the free
memory on the target node is lower than high watermark?

Best Regards,
Huang, Ying
Michal Hocko March 3, 2020, 8:09 a.m. UTC | #17
On Tue 03-03-20 09:51:56, Huang, Ying wrote:
> Mel Gorman <mgorman@suse.de> writes:
> > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote:
> >> If some applications cannot tolerate the latency incurred by the memory
> >> allocation and zeroing.  Then we cannot discard instead of migrate
> >> always.  While in some situations, less memory pressure can help.  So
> >> it's better to let the administrator and the application choose the
> >> right behavior in the specific situation?
> >> 
> >
> > Is there an application you have in mind that benefits from discarding
> > MADV_FREE pages instead of migrating them?
> >
> > Allowing the administrator or application to tune this would be very
> > problematic. An application would require an update to the system call
> > to take advantage of it and then detect if the running kernel supports
> > it. An administrator would have to detect that MADV_FREE pages are being
> > prematurely discarded leading to a slowdown and that is hard to detect.
> > It could be inferred from monitoring compaction stats and checking
> > if compaction activity is correlated with higher minor faults in the
> > target application. Proving the correlation would require using the perf
> > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses
> > to MADV_FREE regions that were freed prematurely. That is not an obvious
> > debugging step to take when an application detects latency spikes.
> >
> > Now, you could add a counter specifically for MADV_FREE pages freed for
> > reasons other than memory pressure and hope the administrator knows about
> > the counter and what it means. That type of knowledge could take a long
> > time to spread so it's really very important that there is evidence of
> > an application that suffers due to the current MADV_FREE and migration
> > behaviour.
> 
> OK.  I understand that this patchset isn't a universal win, so we need
> some way to justify it.  I will try to find some application for that.
> 
> Another thought, as proposed by David Hildenbrand, it's may be a
> universal win to discard clean MADV_FREE pages when migrating if there are
> already memory pressure on the target node.  For example, if the free
> memory on the target node is lower than high watermark?

This is already happening because if the target node is short on memory
it will start to reclaim and if MADV_FREE pages are at the tail of
inactive file LRU list then they will be dropped. Please note how that
follows proper aging and doesn't introduce any special casing. Really
MADV_FREE is an inactive cache for anonymous memory and we treat it like
inactive page cache. This is not carved in stone of course but it really
requires very good justification to change.
Michal Hocko March 3, 2020, 8:19 a.m. UTC | #18
On Tue 03-03-20 09:30:28, Huang, Ying wrote:
[...]
> Yes.  mmap() can control whether to populate the underlying physical
> pages.

right because many usecases benefit from it. They simply know that the
mapping will be used completely and it is worth saving overhead for #PF.
See. there is a clear justification for that policy.

> But for migrating MADV_FREE pages, there's no control, all pages
> will be populated again always by default.  Maybe we should avoid to do
> that in some situations too.

Now let's have a look here. It is the userspace that decided to mark
MADV_FREE pages. It is under its full control which pages are to be
freed lazily. If the userspace wants to move those pages then it is
likely aware they have been MADV_FREE, right? If the userspace wanted to
save migration overhead then it could either chose to not migrate those
pages or simply unmap them right away. So in the end we are talking
about saving munmap/MAMDV_DONTNEED or potentially more move_pages calls
to skip over MADV_FREE holes. Which is all nice but is there any
userspace that really does care? Because this is a fundamental question
here and it doesn't make much sense to discuss this left to right unless
this is clear.
Huang, Ying March 3, 2020, 8:47 a.m. UTC | #19
Michal Hocko <mhocko@kernel.org> writes:

> On Tue 03-03-20 09:51:56, Huang, Ying wrote:
>> Mel Gorman <mgorman@suse.de> writes:
>> > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote:
>> >> If some applications cannot tolerate the latency incurred by the memory
>> >> allocation and zeroing.  Then we cannot discard instead of migrate
>> >> always.  While in some situations, less memory pressure can help.  So
>> >> it's better to let the administrator and the application choose the
>> >> right behavior in the specific situation?
>> >> 
>> >
>> > Is there an application you have in mind that benefits from discarding
>> > MADV_FREE pages instead of migrating them?
>> >
>> > Allowing the administrator or application to tune this would be very
>> > problematic. An application would require an update to the system call
>> > to take advantage of it and then detect if the running kernel supports
>> > it. An administrator would have to detect that MADV_FREE pages are being
>> > prematurely discarded leading to a slowdown and that is hard to detect.
>> > It could be inferred from monitoring compaction stats and checking
>> > if compaction activity is correlated with higher minor faults in the
>> > target application. Proving the correlation would require using the perf
>> > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses
>> > to MADV_FREE regions that were freed prematurely. That is not an obvious
>> > debugging step to take when an application detects latency spikes.
>> >
>> > Now, you could add a counter specifically for MADV_FREE pages freed for
>> > reasons other than memory pressure and hope the administrator knows about
>> > the counter and what it means. That type of knowledge could take a long
>> > time to spread so it's really very important that there is evidence of
>> > an application that suffers due to the current MADV_FREE and migration
>> > behaviour.
>> 
>> OK.  I understand that this patchset isn't a universal win, so we need
>> some way to justify it.  I will try to find some application for that.
>> 
>> Another thought, as proposed by David Hildenbrand, it's may be a
>> universal win to discard clean MADV_FREE pages when migrating if there are
>> already memory pressure on the target node.  For example, if the free
>> memory on the target node is lower than high watermark?
>
> This is already happening because if the target node is short on memory
> it will start to reclaim and if MADV_FREE pages are at the tail of
> inactive file LRU list then they will be dropped. Please note how that
> follows proper aging and doesn't introduce any special casing. Really
> MADV_FREE is an inactive cache for anonymous memory and we treat it like
> inactive page cache. This is not carved in stone of course but it really
> requires very good justification to change.

If my understanding were correct, the newly migrated clean MADV_FREE
pages will be put at the head of inactive file LRU list instead of the
tail.  So it's possible that some useful file cache pages will be
reclaimed.

Best Regards,
Huang, Ying
Michal Hocko March 3, 2020, 8:58 a.m. UTC | #20
On Tue 03-03-20 16:47:54, Huang, Ying wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Tue 03-03-20 09:51:56, Huang, Ying wrote:
> >> Mel Gorman <mgorman@suse.de> writes:
> >> > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote:
> >> >> If some applications cannot tolerate the latency incurred by the memory
> >> >> allocation and zeroing.  Then we cannot discard instead of migrate
> >> >> always.  While in some situations, less memory pressure can help.  So
> >> >> it's better to let the administrator and the application choose the
> >> >> right behavior in the specific situation?
> >> >> 
> >> >
> >> > Is there an application you have in mind that benefits from discarding
> >> > MADV_FREE pages instead of migrating them?
> >> >
> >> > Allowing the administrator or application to tune this would be very
> >> > problematic. An application would require an update to the system call
> >> > to take advantage of it and then detect if the running kernel supports
> >> > it. An administrator would have to detect that MADV_FREE pages are being
> >> > prematurely discarded leading to a slowdown and that is hard to detect.
> >> > It could be inferred from monitoring compaction stats and checking
> >> > if compaction activity is correlated with higher minor faults in the
> >> > target application. Proving the correlation would require using the perf
> >> > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses
> >> > to MADV_FREE regions that were freed prematurely. That is not an obvious
> >> > debugging step to take when an application detects latency spikes.
> >> >
> >> > Now, you could add a counter specifically for MADV_FREE pages freed for
> >> > reasons other than memory pressure and hope the administrator knows about
> >> > the counter and what it means. That type of knowledge could take a long
> >> > time to spread so it's really very important that there is evidence of
> >> > an application that suffers due to the current MADV_FREE and migration
> >> > behaviour.
> >> 
> >> OK.  I understand that this patchset isn't a universal win, so we need
> >> some way to justify it.  I will try to find some application for that.
> >> 
> >> Another thought, as proposed by David Hildenbrand, it's may be a
> >> universal win to discard clean MADV_FREE pages when migrating if there are
> >> already memory pressure on the target node.  For example, if the free
> >> memory on the target node is lower than high watermark?
> >
> > This is already happening because if the target node is short on memory
> > it will start to reclaim and if MADV_FREE pages are at the tail of
> > inactive file LRU list then they will be dropped. Please note how that
> > follows proper aging and doesn't introduce any special casing. Really
> > MADV_FREE is an inactive cache for anonymous memory and we treat it like
> > inactive page cache. This is not carved in stone of course but it really
> > requires very good justification to change.
> 
> If my understanding were correct, the newly migrated clean MADV_FREE
> pages will be put at the head of inactive file LRU list instead of the
> tail.  So it's possible that some useful file cache pages will be
> reclaimed.

This is the case also when you migrate other pages, right? We simply
cannot preserve the aging.
Huang, Ying March 3, 2020, 11:36 a.m. UTC | #21
Michal Hocko <mhocko@kernel.org> writes:

> On Tue 03-03-20 09:30:28, Huang, Ying wrote:
> [...]
>> Yes.  mmap() can control whether to populate the underlying physical
>> pages.
>
> right because many usecases benefit from it. They simply know that the
> mapping will be used completely and it is worth saving overhead for #PF.
> See. there is a clear justification for that policy.
>
>> But for migrating MADV_FREE pages, there's no control, all pages
>> will be populated again always by default.  Maybe we should avoid to do
>> that in some situations too.
>
> Now let's have a look here. It is the userspace that decided to mark
> MADV_FREE pages. It is under its full control which pages are to be
> freed lazily. If the userspace wants to move those pages then it is
> likely aware they have been MADV_FREE, right? If the userspace wanted to
> save migration overhead then it could either chose to not migrate those
> pages or simply unmap them right away. So in the end we are talking
> about saving munmap/MAMDV_DONTNEED or potentially more move_pages calls
> to skip over MADV_FREE holes. Which is all nice but is there any
> userspace that really does care? Because this is a fundamental question
> here and it doesn't make much sense to discuss this left to right unless
> this is clear.

Although I don't agree with you, I don't want to continue.  Because I
feel that the discussion may be too general to go anywhere.  I admit
that I go to the general side firstly, sorry about that.

Best Regards,
Huang, Ying
Huang, Ying March 3, 2020, 11:49 a.m. UTC | #22
Michal Hocko <mhocko@kernel.org> writes:

> On Tue 03-03-20 16:47:54, Huang, Ying wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Tue 03-03-20 09:51:56, Huang, Ying wrote:
>> >> Mel Gorman <mgorman@suse.de> writes:
>> >> > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote:
>> >> >> If some applications cannot tolerate the latency incurred by the memory
>> >> >> allocation and zeroing.  Then we cannot discard instead of migrate
>> >> >> always.  While in some situations, less memory pressure can help.  So
>> >> >> it's better to let the administrator and the application choose the
>> >> >> right behavior in the specific situation?
>> >> >> 
>> >> >
>> >> > Is there an application you have in mind that benefits from discarding
>> >> > MADV_FREE pages instead of migrating them?
>> >> >
>> >> > Allowing the administrator or application to tune this would be very
>> >> > problematic. An application would require an update to the system call
>> >> > to take advantage of it and then detect if the running kernel supports
>> >> > it. An administrator would have to detect that MADV_FREE pages are being
>> >> > prematurely discarded leading to a slowdown and that is hard to detect.
>> >> > It could be inferred from monitoring compaction stats and checking
>> >> > if compaction activity is correlated with higher minor faults in the
>> >> > target application. Proving the correlation would require using the perf
>> >> > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses
>> >> > to MADV_FREE regions that were freed prematurely. That is not an obvious
>> >> > debugging step to take when an application detects latency spikes.
>> >> >
>> >> > Now, you could add a counter specifically for MADV_FREE pages freed for
>> >> > reasons other than memory pressure and hope the administrator knows about
>> >> > the counter and what it means. That type of knowledge could take a long
>> >> > time to spread so it's really very important that there is evidence of
>> >> > an application that suffers due to the current MADV_FREE and migration
>> >> > behaviour.
>> >> 
>> >> OK.  I understand that this patchset isn't a universal win, so we need
>> >> some way to justify it.  I will try to find some application for that.
>> >> 
>> >> Another thought, as proposed by David Hildenbrand, it's may be a
>> >> universal win to discard clean MADV_FREE pages when migrating if there are
>> >> already memory pressure on the target node.  For example, if the free
>> >> memory on the target node is lower than high watermark?
>> >
>> > This is already happening because if the target node is short on memory
>> > it will start to reclaim and if MADV_FREE pages are at the tail of
>> > inactive file LRU list then they will be dropped. Please note how that
>> > follows proper aging and doesn't introduce any special casing. Really
>> > MADV_FREE is an inactive cache for anonymous memory and we treat it like
>> > inactive page cache. This is not carved in stone of course but it really
>> > requires very good justification to change.
>> 
>> If my understanding were correct, the newly migrated clean MADV_FREE
>> pages will be put at the head of inactive file LRU list instead of the
>> tail.  So it's possible that some useful file cache pages will be
>> reclaimed.
>
> This is the case also when you migrate other pages, right? We simply
> cannot preserve the aging.

So you consider the priority of the clean MADV_FREE pages is same as
that of page cache pages?  Because the penalty difference is so large, I
think it may be a good idea to always put clean MADV_FREE pages at the
tail of the inactive file LRU list?

Best Regards,
Huang, Ying
Mel Gorman March 3, 2020, 1:02 p.m. UTC | #23
On Tue, Mar 03, 2020 at 09:51:56AM +0800, Huang, Ying wrote:
> Mel Gorman <mgorman@suse.de> writes:
> > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote:
> >> If some applications cannot tolerate the latency incurred by the memory
> >> allocation and zeroing.  Then we cannot discard instead of migrate
> >> always.  While in some situations, less memory pressure can help.  So
> >> it's better to let the administrator and the application choose the
> >> right behavior in the specific situation?
> >> 
> >
> > Is there an application you have in mind that benefits from discarding
> > MADV_FREE pages instead of migrating them?
> >
> > Allowing the administrator or application to tune this would be very
> > problematic. An application would require an update to the system call
> > to take advantage of it and then detect if the running kernel supports
> > it. An administrator would have to detect that MADV_FREE pages are being
> > prematurely discarded leading to a slowdown and that is hard to detect.
> > It could be inferred from monitoring compaction stats and checking
> > if compaction activity is correlated with higher minor faults in the
> > target application. Proving the correlation would require using the perf
> > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses
> > to MADV_FREE regions that were freed prematurely. That is not an obvious
> > debugging step to take when an application detects latency spikes.
> >
> > Now, you could add a counter specifically for MADV_FREE pages freed for
> > reasons other than memory pressure and hope the administrator knows about
> > the counter and what it means. That type of knowledge could take a long
> > time to spread so it's really very important that there is evidence of
> > an application that suffers due to the current MADV_FREE and migration
> > behaviour.
> 
> OK.  I understand that this patchset isn't a universal win, so we need
> some way to justify it.  I will try to find some application for that.
> 
> Another thought, as proposed by David Hildenbrand, it's may be a
> universal win to discard clean MADV_FREE pages when migrating if there are
> already memory pressure on the target node.  For example, if the free
> memory on the target node is lower than high watermark?
> 

That is an extremely specific corner case that is not likely to occur.
NUMA balancing is not going to migrate a MADV_FREE page under these
circumstances as a write cancels MADV_FREE is read attempt will probably
fail to allocate a destination page in alloc_misplaced_dst_page so the
data gets lost instead of remaining remote. sys_movepages is a possibility
but the circumstances of an application delibertly trying to migrate to
a loaded node is low. Compaction never migrates cross-node so the state
of a remote node under pressure do not matter.

Once again, there needs to be a reasonable use case to be able to
meaningfully balance between the benefits and risks of changing the
MADV_FREE semantics.
Huang, Ying March 4, 2020, 12:33 a.m. UTC | #24
Mel Gorman <mgorman@suse.de> writes:

> On Tue, Mar 03, 2020 at 09:51:56AM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@suse.de> writes:
>> > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote:
>> >> If some applications cannot tolerate the latency incurred by the memory
>> >> allocation and zeroing.  Then we cannot discard instead of migrate
>> >> always.  While in some situations, less memory pressure can help.  So
>> >> it's better to let the administrator and the application choose the
>> >> right behavior in the specific situation?
>> >> 
>> >
>> > Is there an application you have in mind that benefits from discarding
>> > MADV_FREE pages instead of migrating them?
>> >
>> > Allowing the administrator or application to tune this would be very
>> > problematic. An application would require an update to the system call
>> > to take advantage of it and then detect if the running kernel supports
>> > it. An administrator would have to detect that MADV_FREE pages are being
>> > prematurely discarded leading to a slowdown and that is hard to detect.
>> > It could be inferred from monitoring compaction stats and checking
>> > if compaction activity is correlated with higher minor faults in the
>> > target application. Proving the correlation would require using the perf
>> > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses
>> > to MADV_FREE regions that were freed prematurely. That is not an obvious
>> > debugging step to take when an application detects latency spikes.
>> >
>> > Now, you could add a counter specifically for MADV_FREE pages freed for
>> > reasons other than memory pressure and hope the administrator knows about
>> > the counter and what it means. That type of knowledge could take a long
>> > time to spread so it's really very important that there is evidence of
>> > an application that suffers due to the current MADV_FREE and migration
>> > behaviour.
>> 
>> OK.  I understand that this patchset isn't a universal win, so we need
>> some way to justify it.  I will try to find some application for that.
>> 
>> Another thought, as proposed by David Hildenbrand, it's may be a
>> universal win to discard clean MADV_FREE pages when migrating if there are
>> already memory pressure on the target node.  For example, if the free
>> memory on the target node is lower than high watermark?
>> 
>
> That is an extremely specific corner case that is not likely to occur.
> NUMA balancing is not going to migrate a MADV_FREE page under these
> circumstances as a write cancels MADV_FREE is read attempt will probably
> fail to allocate a destination page in alloc_misplaced_dst_page so the
> data gets lost instead of remaining remote. sys_movepages is a possibility
> but the circumstances of an application delibertly trying to migrate to
> a loaded node is low. Compaction never migrates cross-node so the state
> of a remote node under pressure do not matter.
>
> Once again, there needs to be a reasonable use case to be able to
> meaningfully balance between the benefits and risks of changing the
> MADV_FREE semantics.

OK.  Will try to find some workloads for this.

Best Regards,
Huang, Ying
Michal Hocko March 4, 2020, 9:58 a.m. UTC | #25
On Tue 03-03-20 19:49:53, Huang, Ying wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Tue 03-03-20 16:47:54, Huang, Ying wrote:
> >> Michal Hocko <mhocko@kernel.org> writes:
> >> 
> >> > On Tue 03-03-20 09:51:56, Huang, Ying wrote:
> >> >> Mel Gorman <mgorman@suse.de> writes:
> >> >> > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote:
> >> >> >> If some applications cannot tolerate the latency incurred by the memory
> >> >> >> allocation and zeroing.  Then we cannot discard instead of migrate
> >> >> >> always.  While in some situations, less memory pressure can help.  So
> >> >> >> it's better to let the administrator and the application choose the
> >> >> >> right behavior in the specific situation?
> >> >> >> 
> >> >> >
> >> >> > Is there an application you have in mind that benefits from discarding
> >> >> > MADV_FREE pages instead of migrating them?
> >> >> >
> >> >> > Allowing the administrator or application to tune this would be very
> >> >> > problematic. An application would require an update to the system call
> >> >> > to take advantage of it and then detect if the running kernel supports
> >> >> > it. An administrator would have to detect that MADV_FREE pages are being
> >> >> > prematurely discarded leading to a slowdown and that is hard to detect.
> >> >> > It could be inferred from monitoring compaction stats and checking
> >> >> > if compaction activity is correlated with higher minor faults in the
> >> >> > target application. Proving the correlation would require using the perf
> >> >> > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses
> >> >> > to MADV_FREE regions that were freed prematurely. That is not an obvious
> >> >> > debugging step to take when an application detects latency spikes.
> >> >> >
> >> >> > Now, you could add a counter specifically for MADV_FREE pages freed for
> >> >> > reasons other than memory pressure and hope the administrator knows about
> >> >> > the counter and what it means. That type of knowledge could take a long
> >> >> > time to spread so it's really very important that there is evidence of
> >> >> > an application that suffers due to the current MADV_FREE and migration
> >> >> > behaviour.
> >> >> 
> >> >> OK.  I understand that this patchset isn't a universal win, so we need
> >> >> some way to justify it.  I will try to find some application for that.
> >> >> 
> >> >> Another thought, as proposed by David Hildenbrand, it's may be a
> >> >> universal win to discard clean MADV_FREE pages when migrating if there are
> >> >> already memory pressure on the target node.  For example, if the free
> >> >> memory on the target node is lower than high watermark?
> >> >
> >> > This is already happening because if the target node is short on memory
> >> > it will start to reclaim and if MADV_FREE pages are at the tail of
> >> > inactive file LRU list then they will be dropped. Please note how that
> >> > follows proper aging and doesn't introduce any special casing. Really
> >> > MADV_FREE is an inactive cache for anonymous memory and we treat it like
> >> > inactive page cache. This is not carved in stone of course but it really
> >> > requires very good justification to change.
> >> 
> >> If my understanding were correct, the newly migrated clean MADV_FREE
> >> pages will be put at the head of inactive file LRU list instead of the
> >> tail.  So it's possible that some useful file cache pages will be
> >> reclaimed.
> >
> > This is the case also when you migrate other pages, right? We simply
> > cannot preserve the aging.
> 
> So you consider the priority of the clean MADV_FREE pages is same as
> that of page cache pages?

This is how MADV_FREE has been implemented, yes. See f7ad2a6cb9f7 ("mm:
move MADV_FREE pages into LRU_INACTIVE_FILE list") for the
justification.

> Because the penalty difference is so large, I
> think it may be a good idea to always put clean MADV_FREE pages at the
> tail of the inactive file LRU list?

You are again making assumptions without giving any actual real
examples. Reconstructing MADV_FREE pages cost can differ a lot. This
really depends on the specific usecase. Moving pages to the tail of LRU
would make them the primary candidate for the reclaim with a strange
LIFO semantic. Adding them to the head might be not the universal win
but it will at least provide a reasonable FIFO semantic. I also find
it much more easier to reason about MADV_FREE as an inactive cache.
Mel Gorman March 4, 2020, 10:56 a.m. UTC | #26
On Wed, Mar 04, 2020 at 10:58:02AM +0100, Michal Hocko wrote:
> > >> If my understanding were correct, the newly migrated clean MADV_FREE
> > >> pages will be put at the head of inactive file LRU list instead of the
> > >> tail.  So it's possible that some useful file cache pages will be
> > >> reclaimed.
> > >
> > > This is the case also when you migrate other pages, right? We simply
> > > cannot preserve the aging.
> > 
> > So you consider the priority of the clean MADV_FREE pages is same as
> > that of page cache pages?
> 
> This is how MADV_FREE has been implemented, yes. See f7ad2a6cb9f7 ("mm:
> move MADV_FREE pages into LRU_INACTIVE_FILE list") for the
> justification.
> 
> > Because the penalty difference is so large, I
> > think it may be a good idea to always put clean MADV_FREE pages at the
> > tail of the inactive file LRU list?
> 
> You are again making assumptions without giving any actual real
> examples. Reconstructing MADV_FREE pages cost can differ a lot. This
> really depends on the specific usecase. Moving pages to the tail of LRU
> would make them the primary candidate for the reclaim with a strange
> LIFO semantic. Adding them to the head might be not the universal win
> but it will at least provide a reasonable FIFO semantic. I also find
> it much more easier to reason about MADV_FREE as an inactive cache.

I tend to agree, that would make MADV_FREE behave more like a
PageReclaim page that gets tagged for immediate reclaim when writeback
completes. Immediate reclaim is in response to heavy memory pressure where
there is trouble finding clean file pages to reclaim and dirty/writeback
pages are getting artifically preserved over hot-but-clean file
pages. That is a clear inversion of the order pages should be reclaimed
and is justified.  While there *might* be a basis for reclaiming MADV_FREE
sooner rather than later, there would have to be some evidence of a Page
inversion problem where a known hot page was getting reclaimed before
MADV_FREE pages. For example, it could easily be considered a bug to free
MADV_FREE pages over a page that was last touched at boot time.

About the only real concern I could find about MADV_FREE is that it
keeps RSS artifically high relative to MADV_DONTNEED in the absense of
memory pressure. In some cases userspace provided a way of switching to
MADV_DONTNEED at startup time to determine if there is a memory leak or
just MADV_FREE keeping pages resident. They probably would have benefitted
from a counter noting the number of MADV_FREE pages in the system as
opposed to the vmstat event or some other way of distinguishing real RSS
from MADV_FREE.

However, I can't find a bug report indicating that MADV_FREE pages were
pushing hot pages out to disk (be it file-backed or anonymous).
Huang, Ying March 4, 2020, 11:15 a.m. UTC | #27
Michal Hocko <mhocko@kernel.org> writes:

> On Tue 03-03-20 19:49:53, Huang, Ying wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Tue 03-03-20 16:47:54, Huang, Ying wrote:
>> >> Michal Hocko <mhocko@kernel.org> writes:
>> >> 
>> >> > On Tue 03-03-20 09:51:56, Huang, Ying wrote:
>> >> >> Mel Gorman <mgorman@suse.de> writes:
>> >> >> > On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote:
>> >> >> >> If some applications cannot tolerate the latency incurred by the memory
>> >> >> >> allocation and zeroing.  Then we cannot discard instead of migrate
>> >> >> >> always.  While in some situations, less memory pressure can help.  So
>> >> >> >> it's better to let the administrator and the application choose the
>> >> >> >> right behavior in the specific situation?
>> >> >> >> 
>> >> >> >
>> >> >> > Is there an application you have in mind that benefits from discarding
>> >> >> > MADV_FREE pages instead of migrating them?
>> >> >> >
>> >> >> > Allowing the administrator or application to tune this would be very
>> >> >> > problematic. An application would require an update to the system call
>> >> >> > to take advantage of it and then detect if the running kernel supports
>> >> >> > it. An administrator would have to detect that MADV_FREE pages are being
>> >> >> > prematurely discarded leading to a slowdown and that is hard to detect.
>> >> >> > It could be inferred from monitoring compaction stats and checking
>> >> >> > if compaction activity is correlated with higher minor faults in the
>> >> >> > target application. Proving the correlation would require using the perf
>> >> >> > software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses
>> >> >> > to MADV_FREE regions that were freed prematurely. That is not an obvious
>> >> >> > debugging step to take when an application detects latency spikes.
>> >> >> >
>> >> >> > Now, you could add a counter specifically for MADV_FREE pages freed for
>> >> >> > reasons other than memory pressure and hope the administrator knows about
>> >> >> > the counter and what it means. That type of knowledge could take a long
>> >> >> > time to spread so it's really very important that there is evidence of
>> >> >> > an application that suffers due to the current MADV_FREE and migration
>> >> >> > behaviour.
>> >> >> 
>> >> >> OK.  I understand that this patchset isn't a universal win, so we need
>> >> >> some way to justify it.  I will try to find some application for that.
>> >> >> 
>> >> >> Another thought, as proposed by David Hildenbrand, it's may be a
>> >> >> universal win to discard clean MADV_FREE pages when migrating if there are
>> >> >> already memory pressure on the target node.  For example, if the free
>> >> >> memory on the target node is lower than high watermark?
>> >> >
>> >> > This is already happening because if the target node is short on memory
>> >> > it will start to reclaim and if MADV_FREE pages are at the tail of
>> >> > inactive file LRU list then they will be dropped. Please note how that
>> >> > follows proper aging and doesn't introduce any special casing. Really
>> >> > MADV_FREE is an inactive cache for anonymous memory and we treat it like
>> >> > inactive page cache. This is not carved in stone of course but it really
>> >> > requires very good justification to change.
>> >> 
>> >> If my understanding were correct, the newly migrated clean MADV_FREE
>> >> pages will be put at the head of inactive file LRU list instead of the
>> >> tail.  So it's possible that some useful file cache pages will be
>> >> reclaimed.
>> >
>> > This is the case also when you migrate other pages, right? We simply
>> > cannot preserve the aging.
>> 
>> So you consider the priority of the clean MADV_FREE pages is same as
>> that of page cache pages?
>
> This is how MADV_FREE has been implemented, yes. See f7ad2a6cb9f7 ("mm:
> move MADV_FREE pages into LRU_INACTIVE_FILE list") for the
> justification.

Thanks for information.  It's really helpful!

>> Because the penalty difference is so large, I
>> think it may be a good idea to always put clean MADV_FREE pages at the
>> tail of the inactive file LRU list?
>
> You are again making assumptions without giving any actual real
> examples. Reconstructing MADV_FREE pages cost can differ a lot.

In which situation the cost to reconstruct MADV_FREE pages can be higher
than the cost to allocate file cache page and read from disk?  Heavy
contention on mmap_sem?

> This really depends on the specific usecase. Moving pages to the tail
> of LRU would make them the primary candidate for the reclaim with a
> strange LIFO semantic. Adding them to the head might be not the
> universal win but it will at least provide a reasonable FIFO
> semantic. I also find it much more easier to reason about MADV_FREE as
> an inactive cache.

Yes.  FIFO is more reasonable than LIFO.

Best Regards,
Huang, Ying
Michal Hocko March 4, 2020, 11:26 a.m. UTC | #28
On Wed 04-03-20 19:15:20, Huang, Ying wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Tue 03-03-20 19:49:53, Huang, Ying wrote:
[...]
> >> Because the penalty difference is so large, I
> >> think it may be a good idea to always put clean MADV_FREE pages at the
> >> tail of the inactive file LRU list?
> >
> > You are again making assumptions without giving any actual real
> > examples. Reconstructing MADV_FREE pages cost can differ a lot.
> 
> In which situation the cost to reconstruct MADV_FREE pages can be higher
> than the cost to allocate file cache page and read from disk?  Heavy
> contention on mmap_sem?

Allocating a page might be really costly but even without that.
Reconstructnig the content of the page can be really high and actually
much larger than an IO to get a page from the storage. Consider
decompression or some other transformations to get the content.
Userspace has means to detect that the content is still up-to-date as
already has been mentioned so all those steps can be avoided.
Huang, Ying March 5, 2020, 1:42 a.m. UTC | #29
Mel Gorman <mgorman@suse.de> writes:

> On Wed, Mar 04, 2020 at 10:58:02AM +0100, Michal Hocko wrote:
>> > >> If my understanding were correct, the newly migrated clean MADV_FREE
>> > >> pages will be put at the head of inactive file LRU list instead of the
>> > >> tail.  So it's possible that some useful file cache pages will be
>> > >> reclaimed.
>> > >
>> > > This is the case also when you migrate other pages, right? We simply
>> > > cannot preserve the aging.
>> > 
>> > So you consider the priority of the clean MADV_FREE pages is same as
>> > that of page cache pages?
>> 
>> This is how MADV_FREE has been implemented, yes. See f7ad2a6cb9f7 ("mm:
>> move MADV_FREE pages into LRU_INACTIVE_FILE list") for the
>> justification.
>> 
>> > Because the penalty difference is so large, I
>> > think it may be a good idea to always put clean MADV_FREE pages at the
>> > tail of the inactive file LRU list?
>> 
>> You are again making assumptions without giving any actual real
>> examples. Reconstructing MADV_FREE pages cost can differ a lot. This
>> really depends on the specific usecase. Moving pages to the tail of LRU
>> would make them the primary candidate for the reclaim with a strange
>> LIFO semantic. Adding them to the head might be not the universal win
>> but it will at least provide a reasonable FIFO semantic. I also find
>> it much more easier to reason about MADV_FREE as an inactive cache.
>
> I tend to agree, that would make MADV_FREE behave more like a
> PageReclaim page that gets tagged for immediate reclaim when writeback
> completes. Immediate reclaim is in response to heavy memory pressure where
> there is trouble finding clean file pages to reclaim and dirty/writeback
> pages are getting artifically preserved over hot-but-clean file
> pages. That is a clear inversion of the order pages should be reclaimed
> and is justified.  While there *might* be a basis for reclaiming MADV_FREE
> sooner rather than later, there would have to be some evidence of a Page
> inversion problem where a known hot page was getting reclaimed before
> MADV_FREE pages. For example, it could easily be considered a bug to free
> MADV_FREE pages over a page that was last touched at boot time.

Yes.  This sounds reasonable.  Although the current solution isn't
perfect, it can avoid the really bad situation as above.

Best Regards,
Huang, Ying
Huang, Ying March 5, 2020, 1:45 a.m. UTC | #30
Michal Hocko <mhocko@kernel.org> writes:

> On Wed 04-03-20 19:15:20, Huang, Ying wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Tue 03-03-20 19:49:53, Huang, Ying wrote:
> [...]
>> >> Because the penalty difference is so large, I
>> >> think it may be a good idea to always put clean MADV_FREE pages at the
>> >> tail of the inactive file LRU list?
>> >
>> > You are again making assumptions without giving any actual real
>> > examples. Reconstructing MADV_FREE pages cost can differ a lot.
>> 
>> In which situation the cost to reconstruct MADV_FREE pages can be higher
>> than the cost to allocate file cache page and read from disk?  Heavy
>> contention on mmap_sem?
>
> Allocating a page might be really costly but even without that.

File cache page needs to be allocated too.

> Reconstructnig the content of the page can be really high and actually
> much larger than an IO to get a page from the storage. Consider
> decompression or some other transformations to get the content.
> Userspace has means to detect that the content is still up-to-date as
> already has been mentioned so all those steps can be avoided.

Although I don't believe someone has really done this.  It's possible at
least in theory.

Best Regards,
Huang, Ying
Mel Gorman March 5, 2020, 10:48 a.m. UTC | #31
On Wed, Mar 04, 2020 at 07:15:20PM +0800, Huang, Ying wrote:
> In which situation the cost to reconstruct MADV_FREE pages can be higher
> than the cost to allocate file cache page and read from disk?  Heavy
> contention on mmap_sem?
> 

MADV_FREE should be anonymous only

if (behavior == MADV_FREE)
                return madvise_free_single_vma(vma, start, end);

.....

static int madvise_free_single_vma(struct vm_area_struct *vma,
                        unsigned long start_addr, unsigned long end_addr)
{
        struct mm_struct *mm = vma->vm_mm;
        struct mmu_notifier_range range;
        struct mmu_gather tlb;

        /* MADV_FREE works for only anon vma at the moment */
        if (!vma_is_anonymous(vma))
                return -EINVAL

So the question is not applicable. For anonymous memory, the cost of
updating a PTE is lower than allocating a page, zeroing it and updating
the PTE.

It has been repeatedly stated now for almost a week that a semantic
change to MADV_FREE should be based on a problem encountered by a real
application that can benefit from the new semantics. I think the only
concrete outcome has been that userspace potentially benefits if the total
number of MADV_FREE pages is reported globally. Even that is marginal as
smaps has the information to tell the difference between high RSS due to
a memory leak and high RSS usage due to MADV_FREE. The /proc/vmstats for
MADV_FREE are of marginal benefit given that they do not tell us much
about the current number of MADV_FREE pages in the system.
Huang, Ying March 6, 2020, 4:05 a.m. UTC | #32
Mel Gorman <mgorman@suse.de> writes:

> On Wed, Mar 04, 2020 at 07:15:20PM +0800, Huang, Ying wrote:
>> In which situation the cost to reconstruct MADV_FREE pages can be higher
>> than the cost to allocate file cache page and read from disk?  Heavy
>> contention on mmap_sem?
>> 
>
> MADV_FREE should be anonymous only
>
> if (behavior == MADV_FREE)
>                 return madvise_free_single_vma(vma, start, end);
>
> .....
>
> static int madvise_free_single_vma(struct vm_area_struct *vma,
>                         unsigned long start_addr, unsigned long end_addr)
> {
>         struct mm_struct *mm = vma->vm_mm;
>         struct mmu_notifier_range range;
>         struct mmu_gather tlb;
>
>         /* MADV_FREE works for only anon vma at the moment */
>         if (!vma_is_anonymous(vma))
>                 return -EINVAL
>
> So the question is not applicable. For anonymous memory, the cost of
> updating a PTE is lower than allocating a page, zeroing it and updating
> the PTE.

Sorry for confusing.  The original question is NOT about comparing the
reconstruction cost between MADV_FREE anon pages and MADV_FREE file
pages, BUT about comparing the reconstruction cost between MADV_FREE
anon pages and ordinary clean file cache pages.  You can find more
details in conversation between Michal and me.

> It has been repeatedly stated now for almost a week that a semantic
> change to MADV_FREE should be based on a problem encountered by a real
> application that can benefit from the new semantics. I think the only
> concrete outcome has been that userspace potentially benefits if the total
> number of MADV_FREE pages is reported globally. Even that is marginal as
> smaps has the information to tell the difference between high RSS due to
> a memory leak and high RSS usage due to MADV_FREE. The /proc/vmstats for
> MADV_FREE are of marginal benefit given that they do not tell us much
> about the current number of MADV_FREE pages in the system.

Got it!  Thanks a lot for your patience and sharing.  That's very
helpful.

Best Regards,
Huang, Ying
Huang, Ying March 9, 2020, 5:26 a.m. UTC | #33
Mel Gorman <mgorman@suse.de> writes:

> I think the only concrete outcome has been that userspace potentially
> benefits if the total number of MADV_FREE pages is reported
> globally. Even that is marginal as smaps has the information to tell
> the difference between high RSS due to a memory leak and high RSS
> usage due to MADV_FREE. The /proc/vmstats for MADV_FREE are of
> marginal benefit given that they do not tell us much about the current
> number of MADV_FREE pages in the system.

We can implement a counter for MADV_FREE that increases when we
ClearPageSwapBacked() and decrease when we SetPageSwapBacked() for an
anonymous page.  But this cannot count lazily freed pages correctly.
Because when a clean MDV_FREE page becomes dirty, there's no page fault
so we will not be notified.  And you have never run into the MADV_FREE
issues other than the memory leaking debugging...

Best Regards,
Huang, Ying