diff mbox series

[v1,09/14] mm: multigenerational lru: mm_struct list

Message ID 20210313075747.3781593-10-yuzhao@google.com (mailing list archive)
State New, archived
Headers show
Series Multigenerational LRU | expand

Commit Message

Yu Zhao March 13, 2021, 7:57 a.m. UTC
Add an infrastructure that maintains either a system-wide mm_struct
list or per-memcg mm_struct lists. Multiple threads can concurrently
work on the same mm_struct list, and each of them will be given a
different mm_struct. Those who finish early can optionally wait on the
rest after the iterator has reached the end of the list.

This infrastructure also tracks whether an mm_struct is being used on
any CPUs or has been used since the last time a worker looked at it.
In other words, workers will not be given an mm_struct that belongs to
a process that has been sleeping.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 fs/exec.c                  |   2 +
 include/linux/memcontrol.h |   4 +
 include/linux/mm_types.h   | 135 +++++++++++++++++++
 include/linux/mmzone.h     |   2 -
 kernel/exit.c              |   1 +
 kernel/fork.c              |  10 ++
 kernel/kthread.c           |   1 +
 kernel/sched/core.c        |   2 +
 mm/memcontrol.c            |  28 ++++
 mm/vmscan.c                | 263 +++++++++++++++++++++++++++++++++++++
 10 files changed, 446 insertions(+), 2 deletions(-)

Comments

Rik van Riel March 15, 2021, 7:40 p.m. UTC | #1
On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:

> +/*
> + * After pages are faulted in, they become the youngest generation.
> They must
> + * go through aging process twice before they can be evicted. After
> first scan,
> + * their accessed bit set during initial faults are cleared and they
> become the
> + * second youngest generation. And second scan makes sure they
> haven't been used
> + * since the first.
> + */

I have to wonder if the reductions in OOM kills and 
low-memory tab discards is due to this aging policy
change, rather than from the switch to virtual scanning.
Huang, Ying March 16, 2021, 2:07 a.m. UTC | #2
Rik van Riel <riel@surriel.com> writes:

> On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
>
>> +/*
>> + * After pages are faulted in, they become the youngest generation.
>> They must
>> + * go through aging process twice before they can be evicted. After
>> first scan,
>> + * their accessed bit set during initial faults are cleared and they
>> become the
>> + * second youngest generation. And second scan makes sure they
>> haven't been used
>> + * since the first.
>> + */
>
> I have to wonder if the reductions in OOM kills and 
> low-memory tab discards is due to this aging policy
> change, rather than from the switch to virtual scanning.

If my understanding were correct, the temperature of the processes is
considered in addition to that of the individual pages.  That is, the
pages of the processes that haven't been scheduled after the previous
scanning will not be scanned.  I guess that this helps OOM kills?

If so, how about just take advantage of that information for OOM killing
and page reclaiming?  For example, if a process hasn't been scheduled
for long time, just reclaim its private pages.

Best Regards,
Huang, Ying
Yu Zhao March 16, 2021, 3:57 a.m. UTC | #3
On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
> Rik van Riel <riel@surriel.com> writes:
> 
> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
> >
> >> +/*
> >> + * After pages are faulted in, they become the youngest generation.
> >> They must
> >> + * go through aging process twice before they can be evicted. After
> >> first scan,
> >> + * their accessed bit set during initial faults are cleared and they
> >> become the
> >> + * second youngest generation. And second scan makes sure they
> >> haven't been used
> >> + * since the first.
> >> + */
> >
> > I have to wonder if the reductions in OOM kills and 
> > low-memory tab discards is due to this aging policy
> > change, rather than from the switch to virtual scanning.

There are no policy changes per se. The current page reclaim also
scans a faulted-in page at least twice before it can reclaim it.
That said, the new aging yields a better overall result because it
discovers every page that has been referenced since the last scan,
in addition to what Ying has mentioned. The current page scan stops
stops once it finds enough candidates, which may seem more
efficiently, but actually pays the price for not finding the best.

> If my understanding were correct, the temperature of the processes is
> considered in addition to that of the individual pages.  That is, the
> pages of the processes that haven't been scheduled after the previous
> scanning will not be scanned.  I guess that this helps OOM kills?

Yes, that's correct.

> If so, how about just take advantage of that information for OOM killing
> and page reclaiming?  For example, if a process hasn't been scheduled
> for long time, just reclaim its private pages.

This is how it works. Pages that haven't been scanned grow older
automatically because those that have been scanned will be tagged with
younger generation numbers. Eviction does bucket sort based on
generation numbers and attacks the oldest.
Huang, Ying March 16, 2021, 6:44 a.m. UTC | #4
Yu Zhao <yuzhao@google.com> writes:

> On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
>> Rik van Riel <riel@surriel.com> writes:
>> 
>> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
>> >
>> >> +/*
>> >> + * After pages are faulted in, they become the youngest generation.
>> >> They must
>> >> + * go through aging process twice before they can be evicted. After
>> >> first scan,
>> >> + * their accessed bit set during initial faults are cleared and they
>> >> become the
>> >> + * second youngest generation. And second scan makes sure they
>> >> haven't been used
>> >> + * since the first.
>> >> + */
>> >
>> > I have to wonder if the reductions in OOM kills and 
>> > low-memory tab discards is due to this aging policy
>> > change, rather than from the switch to virtual scanning.
>
> There are no policy changes per se. The current page reclaim also
> scans a faulted-in page at least twice before it can reclaim it.
> That said, the new aging yields a better overall result because it
> discovers every page that has been referenced since the last scan,
> in addition to what Ying has mentioned. The current page scan stops
> stops once it finds enough candidates, which may seem more
> efficiently, but actually pays the price for not finding the best.
>
>> If my understanding were correct, the temperature of the processes is
>> considered in addition to that of the individual pages.  That is, the
>> pages of the processes that haven't been scheduled after the previous
>> scanning will not be scanned.  I guess that this helps OOM kills?
>
> Yes, that's correct.
>
>> If so, how about just take advantage of that information for OOM killing
>> and page reclaiming?  For example, if a process hasn't been scheduled
>> for long time, just reclaim its private pages.
>
> This is how it works. Pages that haven't been scanned grow older
> automatically because those that have been scanned will be tagged with
> younger generation numbers. Eviction does bucket sort based on
> generation numbers and attacks the oldest.

Sorry, my original words are misleading.  What I wanted to say was that
is it good enough that

- Do not change the core algorithm of current page reclaiming.

- Add some new logic to reclaim the process private pages regardless of
  the Accessed bits if the processes are not scheduled for some long
  enough time.  This can be done before the normal page reclaiming.

So this is an one small step improvement to the current page reclaiming
algorithm via taking advantage of the scheduler information.  It's
clearly not sophisticated as your new algorithm, for example, the cold
pages in the hot processes will not be reclaimed in this stage.  But it
can reduce the overhead of scanning too.

All in all, some of your ideas may help the original LRU algorithm too.
Or some can be experimented without replacing the original algorithm.

But from another point of view, your solution can be seen as a kind of
improvement on top of the original LRU algorithm too.  It moves the
recently accessed pages to kind of multiple active lists based on
scanning page tables directly (instead of reversely).

Best Regards,
Huang, Ying
Yu Zhao March 16, 2021, 7:56 a.m. UTC | #5
On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
> Yu Zhao <yuzhao@google.com> writes:
> 
> > On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
> >> Rik van Riel <riel@surriel.com> writes:
> >> 
> >> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
> >> >
> >> >> +/*
> >> >> + * After pages are faulted in, they become the youngest generation.
> >> >> They must
> >> >> + * go through aging process twice before they can be evicted. After
> >> >> first scan,
> >> >> + * their accessed bit set during initial faults are cleared and they
> >> >> become the
> >> >> + * second youngest generation. And second scan makes sure they
> >> >> haven't been used
> >> >> + * since the first.
> >> >> + */
> >> >
> >> > I have to wonder if the reductions in OOM kills and 
> >> > low-memory tab discards is due to this aging policy
> >> > change, rather than from the switch to virtual scanning.
> >
> > There are no policy changes per se. The current page reclaim also
> > scans a faulted-in page at least twice before it can reclaim it.
> > That said, the new aging yields a better overall result because it
> > discovers every page that has been referenced since the last scan,
> > in addition to what Ying has mentioned. The current page scan stops
> > stops once it finds enough candidates, which may seem more
> > efficiently, but actually pays the price for not finding the best.
> >
> >> If my understanding were correct, the temperature of the processes is
> >> considered in addition to that of the individual pages.  That is, the
> >> pages of the processes that haven't been scheduled after the previous
> >> scanning will not be scanned.  I guess that this helps OOM kills?
> >
> > Yes, that's correct.
> >
> >> If so, how about just take advantage of that information for OOM killing
> >> and page reclaiming?  For example, if a process hasn't been scheduled
> >> for long time, just reclaim its private pages.
> >
> > This is how it works. Pages that haven't been scanned grow older
> > automatically because those that have been scanned will be tagged with
> > younger generation numbers. Eviction does bucket sort based on
> > generation numbers and attacks the oldest.
> 
> Sorry, my original words are misleading.  What I wanted to say was that
> is it good enough that
> 
> - Do not change the core algorithm of current page reclaiming.
> 
> - Add some new logic to reclaim the process private pages regardless of
>   the Accessed bits if the processes are not scheduled for some long
>   enough time.  This can be done before the normal page reclaiming.

This is a good idea, which being used on Android and Chrome OS. We
call it per-process reclaim, and I've mentioned here:
https://lore.kernel.org/linux-mm/YBkT6175GmMWBvw3@google.com/
  On Android, our most advanced simulation that generates memory
  pressure from realistic user behavior shows 18% fewer low-memory
  kills, which in turn reduces cold starts by 16%. This is on top of
  per-process reclaim, a predecessor of ``MADV_COLD`` and
  ``MADV_PAGEOUT``, against background apps.

The patches landed not long a ago :) See mm/madvise.c

> So this is an one small step improvement to the current page reclaiming
> algorithm via taking advantage of the scheduler information.  It's
> clearly not sophisticated as your new algorithm, for example, the cold
> pages in the hot processes will not be reclaimed in this stage.  But it
> can reduce the overhead of scanning too.

The general problems with the direction of per-process reclaim:
  1) we can't find the coldest pages, as you have mentioned.
  2) we can't reach file pages accessed via file descriptors only,
  especially those caching config files that were read only once.
  3) we can't reclaim lru pages and slab objects proportionally and
  therefore we leave many stale slab objects behind.
  4) we have to be proactive, as you suggested (once again, you were
  right), and this has a serious problem: client's battery life can
  be affected.

The scanning overhead is only one of the two major problems of the
current page reclaim. The other problem is the granularity of the
active/inactive (sizes). We stopped using them in making job
scheduling decision a long time ago. I know another large internet
company adopted a similar approach as ours, and I'm wondering how
everybody else is coping with the discrepancy from those counters.

> All in all, some of your ideas may help the original LRU algorithm too.
> Or some can be experimented without replacing the original algorithm.
> 
> But from another point of view, your solution can be seen as a kind of
> improvement on top of the original LRU algorithm too.  It moves the
> recently accessed pages to kind of multiple active lists based on
> scanning page tables directly (instead of reversely).

We hope this series can be a framework or an infrastructure flexible
enough that people can build their complex use cases upon, e.g.,
proactive reclaim (machine-wide, not per process), cold memory
estimation (for job scheduling), AEP demotion, specifically, we want
people to use it with what you and Dave are working on here:
https://patchwork.kernel.org/project/linux-mm/cover/20210304235949.7922C1C3@viggo.jf.intel.com/
Huang, Ying March 17, 2021, 3:37 a.m. UTC | #6
Yu Zhao <yuzhao@google.com> writes:

> On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
>> Yu Zhao <yuzhao@google.com> writes:
>> 
>> > On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
>> >> Rik van Riel <riel@surriel.com> writes:
>> >> 
>> >> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
>> >> >
>> >> >> +/*
>> >> >> + * After pages are faulted in, they become the youngest generation.
>> >> >> They must
>> >> >> + * go through aging process twice before they can be evicted. After
>> >> >> first scan,
>> >> >> + * their accessed bit set during initial faults are cleared and they
>> >> >> become the
>> >> >> + * second youngest generation. And second scan makes sure they
>> >> >> haven't been used
>> >> >> + * since the first.
>> >> >> + */
>> >> >
>> >> > I have to wonder if the reductions in OOM kills and 
>> >> > low-memory tab discards is due to this aging policy
>> >> > change, rather than from the switch to virtual scanning.
>> >
>> > There are no policy changes per se. The current page reclaim also
>> > scans a faulted-in page at least twice before it can reclaim it.
>> > That said, the new aging yields a better overall result because it
>> > discovers every page that has been referenced since the last scan,
>> > in addition to what Ying has mentioned. The current page scan stops
>> > stops once it finds enough candidates, which may seem more
>> > efficiently, but actually pays the price for not finding the best.
>> >
>> >> If my understanding were correct, the temperature of the processes is
>> >> considered in addition to that of the individual pages.  That is, the
>> >> pages of the processes that haven't been scheduled after the previous
>> >> scanning will not be scanned.  I guess that this helps OOM kills?
>> >
>> > Yes, that's correct.
>> >
>> >> If so, how about just take advantage of that information for OOM killing
>> >> and page reclaiming?  For example, if a process hasn't been scheduled
>> >> for long time, just reclaim its private pages.
>> >
>> > This is how it works. Pages that haven't been scanned grow older
>> > automatically because those that have been scanned will be tagged with
>> > younger generation numbers. Eviction does bucket sort based on
>> > generation numbers and attacks the oldest.
>> 
>> Sorry, my original words are misleading.  What I wanted to say was that
>> is it good enough that
>> 
>> - Do not change the core algorithm of current page reclaiming.
>> 
>> - Add some new logic to reclaim the process private pages regardless of
>>   the Accessed bits if the processes are not scheduled for some long
>>   enough time.  This can be done before the normal page reclaiming.
>
> This is a good idea, which being used on Android and Chrome OS. We
> call it per-process reclaim, and I've mentioned here:
> https://lore.kernel.org/linux-mm/YBkT6175GmMWBvw3@google.com/
>   On Android, our most advanced simulation that generates memory
>   pressure from realistic user behavior shows 18% fewer low-memory
>   kills, which in turn reduces cold starts by 16%. This is on top of
>   per-process reclaim, a predecessor of ``MADV_COLD`` and
>   ``MADV_PAGEOUT``, against background apps.

Thanks, now I see your improvement compared with the per-process
reclaim.  How about the per-process reclaim compared with the normal
page reclaiming for the similar test cases?

My intention behind this is that your solution includes several
improvements,

a) take advantage of scheduler information
b) more fine-grained active/inactive dividing
c) page table scanning instead of rmap

Is it possible to evaluate the benefit of the each step?  And is there
still some potential to optimize the current LRU based algorithm before
adopting a totally new algorithm?

> The patches landed not long a ago :) See mm/madvise.c

:) I'm too out-dated.

>> So this is an one small step improvement to the current page reclaiming
>> algorithm via taking advantage of the scheduler information.  It's
>> clearly not sophisticated as your new algorithm, for example, the cold
>> pages in the hot processes will not be reclaimed in this stage.  But it
>> can reduce the overhead of scanning too.
>
> The general problems with the direction of per-process reclaim:
>   1) we can't find the coldest pages, as you have mentioned.
>   2) we can't reach file pages accessed via file descriptors only,
>   especially those caching config files that were read only once.

In theory, this is possible, we can build a inode list based on the
accessing time too.  Although this may not be necessary.  We can reclaim
the read-once file cache before the per-process reclaim in theory.

>   3) we can't reclaim lru pages and slab objects proportionally and
>   therefore we leave many stale slab objects behind.
>   4) we have to be proactive, as you suggested (once again, you were
>   right), and this has a serious problem: client's battery life can
>   be affected.

Why can this not be done reactively?  We can start per-process reclaim
under memory pressure.  This has been used in phone and laptop now, so
there's a solution for this issue?

> The scanning overhead is only one of the two major problems of the
> current page reclaim. The other problem is the granularity of the
> active/inactive (sizes). We stopped using them in making job
> scheduling decision a long time ago. I know another large internet
> company adopted a similar approach as ours, and I'm wondering how
> everybody else is coping with the discrepancy from those counters.

From intuition, the scanning overhead of the full page table scanning
appears higher than that of the rmap scanning for a small portion of
system memory.  But form your words, you think the reality is the
reverse?  If others concern about the overhead too, finally, I think you
need to prove the overhead of the page table scanning isn't too higher,
or even lower with more data and theory.

>> All in all, some of your ideas may help the original LRU algorithm too.
>> Or some can be experimented without replacing the original algorithm.
>> 
>> But from another point of view, your solution can be seen as a kind of
>> improvement on top of the original LRU algorithm too.  It moves the
>> recently accessed pages to kind of multiple active lists based on
>> scanning page tables directly (instead of reversely).
>
> We hope this series can be a framework or an infrastructure flexible
> enough that people can build their complex use cases upon, e.g.,
> proactive reclaim (machine-wide, not per process), cold memory
> estimation (for job scheduling), AEP demotion, specifically, we want
> people to use it with what you and Dave are working on here:
> https://patchwork.kernel.org/project/linux-mm/cover/20210304235949.7922C1C3@viggo.jf.intel.com/

Yes.  A better page reclaiming algorithm will help DRAM->PMEM demotion
much!

Best Regards,
Huang, Ying
Yu Zhao March 17, 2021, 10:46 a.m. UTC | #7
On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
> Yu Zhao <yuzhao@google.com> writes:
> 
> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
> >> Yu Zhao <yuzhao@google.com> writes:
> >> 
> >> > On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
> >> >> Rik van Riel <riel@surriel.com> writes:
> >> >> 
> >> >> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
> >> >> >
> >> >> >> +/*
> >> >> >> + * After pages are faulted in, they become the youngest generation.
> >> >> >> They must
> >> >> >> + * go through aging process twice before they can be evicted. After
> >> >> >> first scan,
> >> >> >> + * their accessed bit set during initial faults are cleared and they
> >> >> >> become the
> >> >> >> + * second youngest generation. And second scan makes sure they
> >> >> >> haven't been used
> >> >> >> + * since the first.
> >> >> >> + */
> >> >> >
> >> >> > I have to wonder if the reductions in OOM kills and 
> >> >> > low-memory tab discards is due to this aging policy
> >> >> > change, rather than from the switch to virtual scanning.
> >> >
> >> > There are no policy changes per se. The current page reclaim also
> >> > scans a faulted-in page at least twice before it can reclaim it.
> >> > That said, the new aging yields a better overall result because it
> >> > discovers every page that has been referenced since the last scan,
> >> > in addition to what Ying has mentioned. The current page scan stops
> >> > stops once it finds enough candidates, which may seem more
> >> > efficiently, but actually pays the price for not finding the best.
> >> >
> >> >> If my understanding were correct, the temperature of the processes is
> >> >> considered in addition to that of the individual pages.  That is, the
> >> >> pages of the processes that haven't been scheduled after the previous
> >> >> scanning will not be scanned.  I guess that this helps OOM kills?
> >> >
> >> > Yes, that's correct.
> >> >
> >> >> If so, how about just take advantage of that information for OOM killing
> >> >> and page reclaiming?  For example, if a process hasn't been scheduled
> >> >> for long time, just reclaim its private pages.
> >> >
> >> > This is how it works. Pages that haven't been scanned grow older
> >> > automatically because those that have been scanned will be tagged with
> >> > younger generation numbers. Eviction does bucket sort based on
> >> > generation numbers and attacks the oldest.
> >> 
> >> Sorry, my original words are misleading.  What I wanted to say was that
> >> is it good enough that
> >> 
> >> - Do not change the core algorithm of current page reclaiming.
> >> 
> >> - Add some new logic to reclaim the process private pages regardless of
> >>   the Accessed bits if the processes are not scheduled for some long
> >>   enough time.  This can be done before the normal page reclaiming.
> >
> > This is a good idea, which being used on Android and Chrome OS. We
> > call it per-process reclaim, and I've mentioned here:
> > https://lore.kernel.org/linux-mm/YBkT6175GmMWBvw3@google.com/
> >   On Android, our most advanced simulation that generates memory
> >   pressure from realistic user behavior shows 18% fewer low-memory
> >   kills, which in turn reduces cold starts by 16%. This is on top of
> >   per-process reclaim, a predecessor of ``MADV_COLD`` and
> >   ``MADV_PAGEOUT``, against background apps.
> 
> Thanks, now I see your improvement compared with the per-process
> reclaim.  How about the per-process reclaim compared with the normal
> page reclaiming for the similar test cases?
> 
> My intention behind this is that your solution includes several
> improvements,
> 
> a) take advantage of scheduler information
> b) more fine-grained active/inactive dividing
> c) page table scanning instead of rmap
> 
> Is it possible to evaluate the benefit of the each step?  And is there
> still some potential to optimize the current LRU based algorithm before
> adopting a totally new algorithm?

Well, there isn't really any new algorithm -- it's still the LRU
(algorithm). But I do see your point. In another survey we posted
here:
https://lore.kernel.org/linux-mm/YBkT6175GmMWBvw3@google.com/
we stated:
  Why not try to improve the existing code?
  -----------------------------------------
  We have tried but concluded the two limiting factors [note]_ in the
  existing code are fundamental, and therefore changes made atop them
  will not result in substantial gains on any of the aspects above.

We learned this the hard way.

> > The patches landed not long a ago :) See mm/madvise.c
> 
> :) I'm too out-dated.
> 
> >> So this is an one small step improvement to the current page reclaiming
> >> algorithm via taking advantage of the scheduler information.  It's
> >> clearly not sophisticated as your new algorithm, for example, the cold
> >> pages in the hot processes will not be reclaimed in this stage.  But it
> >> can reduce the overhead of scanning too.
> >
> > The general problems with the direction of per-process reclaim:
> >   1) we can't find the coldest pages, as you have mentioned.
> >   2) we can't reach file pages accessed via file descriptors only,
> >   especially those caching config files that were read only once.
> 
> In theory, this is possible, we can build a inode list based on the
> accessing time too.  Although this may not be necessary.  We can reclaim
> the read-once file cache before the per-process reclaim in theory.

You have to search for unmapped clean pages in page cache. Generally
searching page cache is a lot more expensive than walking lru lists
because page cache can be sparse but lru lists can't.

> >   3) we can't reclaim lru pages and slab objects proportionally and
> >   therefore we leave many stale slab objects behind.
> >   4) we have to be proactive, as you suggested (once again, you were
> >   right), and this has a serious problem: client's battery life can
> >   be affected.
> 
> Why can this not be done reactively? We can start per-process reclaim
> under memory pressure.

Under memory pressure, we could scan a lot of idle processes and find
nothing to reclaim, e.g., mlockall(). In addition, address spaces can
be sparse too.

You are now looking at the direction of cold memory tracking using
page tables, which is not practical. Apparent this series has given
you a bad idea... Page tables are only good at discovering hot memory.
Take my word for it :)

> This has been used in phone and laptop now, so
> there's a solution for this issue?

madvise() is called based on user behavior, which we don't have in
kernel space.

> > The scanning overhead is only one of the two major problems of the
> > current page reclaim. The other problem is the granularity of the
> > active/inactive (sizes). We stopped using them in making job
> > scheduling decision a long time ago. I know another large internet
> > company adopted a similar approach as ours, and I'm wondering how
> > everybody else is coping with the discrepancy from those counters.
> 
> From intuition, the scanning overhead of the full page table scanning
> appears higher than that of the rmap scanning for a small portion of
> system memory.  But form your words, you think the reality is the
> reverse?  If others concern about the overhead too, finally, I think you
> need to prove the overhead of the page table scanning isn't too higher,
> or even lower with more data and theory.

There is a misunderstanding here. I never said anything about full
page table scanning. And this is not how it's done in this series
either. I guess the misunderstanding has something to do with the cold
memory tracking you are thinking about?

This series uses page tables to discover page accesses when a system
has run out of inactive pages. Under such a situation, the system is
very likely to have a lot of page accesses, and using the rmap is
likely to cost a lot more because its poor memory locality compared
with page tables.

But, page tables can be sparse too, in terms of hot memory tracking.
Dave has asked me to test the worst case scenario, which I'll do.
And I'd be happy to share more data. Any specific workload you are
interested in?
Huang, Ying March 22, 2021, 3:13 a.m. UTC | #8
Yu Zhao <yuzhao@google.com> writes:

> On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
>> Yu Zhao <yuzhao@google.com> writes:
>> 
>> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
>> > The scanning overhead is only one of the two major problems of the
>> > current page reclaim. The other problem is the granularity of the
>> > active/inactive (sizes). We stopped using them in making job
>> > scheduling decision a long time ago. I know another large internet
>> > company adopted a similar approach as ours, and I'm wondering how
>> > everybody else is coping with the discrepancy from those counters.
>> 
>> From intuition, the scanning overhead of the full page table scanning
>> appears higher than that of the rmap scanning for a small portion of
>> system memory.  But form your words, you think the reality is the
>> reverse?  If others concern about the overhead too, finally, I think you
>> need to prove the overhead of the page table scanning isn't too higher,
>> or even lower with more data and theory.
>
> There is a misunderstanding here. I never said anything about full
> page table scanning. And this is not how it's done in this series
> either. I guess the misunderstanding has something to do with the cold
> memory tracking you are thinking about?

If my understanding were correct, from the following code path in your
patch 10/14,

age_active_anon
  age_lru_gens
    try_walk_mm_list
      walk_mm_list
        walk_mm

So, in kswapd(), the page tables of many processes may be scanned
fully.  If the number of processes that are active are high, the
overhead may be high too.

> This series uses page tables to discover page accesses when a system
> has run out of inactive pages. Under such a situation, the system is
> very likely to have a lot of page accesses, and using the rmap is
> likely to cost a lot more because its poor memory locality compared
> with page tables.

This is the theory.  Can you verify this with more data?  Including the
CPU cycles or time spent scanning page tables?

> But, page tables can be sparse too, in terms of hot memory tracking.
> Dave has asked me to test the worst case scenario, which I'll do.
> And I'd be happy to share more data. Any specific workload you are
> interested in?

We can start with some simple workloads that are easier to be reasoned.
For example,

1. Run the workload with hot and cold pages, when the free memory
becomes lower than the low watermark, kswapd will be waken up to scan
and reclaim some cold pages.  How long will it take to do that?  It's
expected that almost all pages need to be scanned, so that page table
scanning is expected to have less overhead.  We can measure how well it
is.

2. Run the workload with hot and cold pages, if the whole working-set
cannot fit in DRAM, that is, the cold pages will be reclaimed and
swapped in regularly (for example tens MB/s).  It's expected that less
pages may be scanned with rmap, but the speed of page table scanning is
faster.

3. Run the workload with hot and cold pages, the system is
overcommitted, that is, some cold pages will be placed in swap.  But the
cold pages are cold enough, so there's almost no thrashing.  Then the
hot working-set of the workload changes, that is, some hot pages become
cold, while some cold pages becomes hot, so page reclaiming and swapin
will be triggered.

For each cases, we can use some different parameters.  And we can
measure something like the number of pages scanned, the time taken to
scan them, the number of page reclaimed and swapped in, etc.

Best Regards,
Huang, Ying
Yu Zhao March 22, 2021, 8:08 a.m. UTC | #9
On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote:
> Yu Zhao <yuzhao@google.com> writes:
> 
> > On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
> >> Yu Zhao <yuzhao@google.com> writes:
> >> 
> >> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
> >> > The scanning overhead is only one of the two major problems of the
> >> > current page reclaim. The other problem is the granularity of the
> >> > active/inactive (sizes). We stopped using them in making job
> >> > scheduling decision a long time ago. I know another large internet
> >> > company adopted a similar approach as ours, and I'm wondering how
> >> > everybody else is coping with the discrepancy from those counters.
> >> 
> >> From intuition, the scanning overhead of the full page table scanning
> >> appears higher than that of the rmap scanning for a small portion of
> >> system memory.  But form your words, you think the reality is the
> >> reverse?  If others concern about the overhead too, finally, I think you
> >> need to prove the overhead of the page table scanning isn't too higher,
> >> or even lower with more data and theory.
> >
> > There is a misunderstanding here. I never said anything about full
> > page table scanning. And this is not how it's done in this series
> > either. I guess the misunderstanding has something to do with the cold
> > memory tracking you are thinking about?
> 
> If my understanding were correct, from the following code path in your
> patch 10/14,
> 
> age_active_anon
>   age_lru_gens
>     try_walk_mm_list
>       walk_mm_list
>         walk_mm
> 
> So, in kswapd(), the page tables of many processes may be scanned
> fully.  If the number of processes that are active are high, the
> overhead may be high too.

That's correct. Just in case we have different definitions of what we
call "full":

  I understand it as the full range of the address space of a process
  that was loaded by switch_mm() at least once since the last scan.
  This is not the case because we don't scan the full range -- we skip
  holes and VMAs that are unevictable, as well as PTE tables that have
  no accessed entries on x86_64, by should_skip_vma() and
  CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG.

  If you are referring to the full range of PTE tables that have at
  least one accessed entry, i.e., other 511 are not none  but have not
  been accessed either since the last scan on x86_64, then yes, you
  are right again :) This is the worse case scenario.
  
> > This series uses page tables to discover page accesses when a system
> > has run out of inactive pages. Under such a situation, the system is
> > very likely to have a lot of page accesses, and using the rmap is
> > likely to cost a lot more because its poor memory locality compared
> > with page tables.
> 
> This is the theory.  Can you verify this with more data?  Including the
> CPU cycles or time spent scanning page tables?

Yes, I'll be happy to do so as I should, because page table scanning
is counterintuitive. Let me add more theory in case it's still unclear
to others.

From my understanding, the two fundamental questions we need to
consider in terms of page reclaim are:

  What are the sizes of hot clusters (spatial locality) should we
  expect under memory pressure?

  On smaller systems with 4GB memory, our observations are that the
  average size of hot clusters found during each scan is 32KB. On
  larger systems with hundreds of gigabytes of memory, it's well
  above this value -- 512KB or larger. These values vary under
  different workloads and with different memory allocators. Unless
  done deliberately by memory allocators, e.g., Scudo as I've
  mentioned earlier, it's safe to say if a PTE entry has been
  accessed, its neighbors are likely to have been accessed too.

  What's hot memory footprint (total size of hot clusters) should we
  expect when we have run out of inactive pages?

  Some numbers first: on large and heavily overcommitted systems, we
  have observed close to 90% during a scan. Those systems have
  millions of pages and using the rmap to find out which pages to
  reclaim will just blow kswapd. On smaller systems with less memory
  pressure (due to their weaker CPUs), this number is more reasonable,
  ~50%. Here is some kswapd profiles from a smaller systems running
  5.11:

   the rmap                                 page table scan
   ---------------------------------------------------------------------
   31.03%  page_vma_mapped_walk             49.36%  lzo1x_1_do_compress
   25.59%  lzo1x_1_do_compress               4.54%  page_vma_mapped_walk
    4.63%  do_raw_spin_lock                  4.45%  memset_erms
    3.89%  vma_interval_tree_iter_next       3.47%  walk_pte_range
    3.33%  vma_interval_tree_subtree_search  2.88%  zram_bvec_rw

  The page table scan is only twice as fast. Only larger systems,
  it's usually more than 4 times, without THP. With THP, both are
  negligible (<1% CPU usage). I can grab profiles from our servers
  too if you are interested in seeing them on 4.15 kernel.

> > But, page tables can be sparse too, in terms of hot memory tracking.
> > Dave has asked me to test the worst case scenario, which I'll do.
> > And I'd be happy to share more data. Any specific workload you are
> > interested in?
> 
> We can start with some simple workloads that are easier to be reasoned.
> For example,
> 
> 1. Run the workload with hot and cold pages, when the free memory
> becomes lower than the low watermark, kswapd will be waken up to scan
> and reclaim some cold pages.  How long will it take to do that?  It's
> expected that almost all pages need to be scanned, so that page table

A typical scenario. Otherwise why would we have run out of cold pages
and still be under memory? Because what's in memory is hot and
therefore most of the them need to be scanned :)

> scanning is expected to have less overhead.  We can measure how well it
> is.

Sounds good to me.

> 2. Run the workload with hot and cold pages, if the whole working-set
> cannot fit in DRAM, that is, the cold pages will be reclaimed and
> swapped in regularly (for example tens MB/s).  It's expected that less
> pages may be scanned with rmap, but the speed of page table scanning is
> faster.

So IIUC, this is a sustained memory pressure, i.e., servers constantly
running under memory pressure?

> 3. Run the workload with hot and cold pages, the system is
> overcommitted, that is, some cold pages will be placed in swap.  But the
> cold pages are cold enough, so there's almost no thrashing.  Then the
> hot working-set of the workload changes, that is, some hot pages become
> cold, while some cold pages becomes hot, so page reclaiming and swapin
> will be triggered.

This is usually what we see on clients, i.e., bursty workloads when
switching from an active app to an inactive one.

> For each cases, we can use some different parameters.  And we can
> measure something like the number of pages scanned, the time taken to
> scan them, the number of page reclaimed and swapped in, etc.

Thanks, I appreciate these -- very well thought test cases. I'll look
into them and probably write some synthetic test cases. If you have
some already, I'd love to get my hands one them.
Huang, Ying March 24, 2021, 6:58 a.m. UTC | #10
Yu Zhao <yuzhao@google.com> writes:

> On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote:
>> Yu Zhao <yuzhao@google.com> writes:
>> 
>> > On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
>> >> Yu Zhao <yuzhao@google.com> writes:
>> >> 
>> >> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
>> >> > The scanning overhead is only one of the two major problems of the
>> >> > current page reclaim. The other problem is the granularity of the
>> >> > active/inactive (sizes). We stopped using them in making job
>> >> > scheduling decision a long time ago. I know another large internet
>> >> > company adopted a similar approach as ours, and I'm wondering how
>> >> > everybody else is coping with the discrepancy from those counters.
>> >> 
>> >> From intuition, the scanning overhead of the full page table scanning
>> >> appears higher than that of the rmap scanning for a small portion of
>> >> system memory.  But form your words, you think the reality is the
>> >> reverse?  If others concern about the overhead too, finally, I think you
>> >> need to prove the overhead of the page table scanning isn't too higher,
>> >> or even lower with more data and theory.
>> >
>> > There is a misunderstanding here. I never said anything about full
>> > page table scanning. And this is not how it's done in this series
>> > either. I guess the misunderstanding has something to do with the cold
>> > memory tracking you are thinking about?
>> 
>> If my understanding were correct, from the following code path in your
>> patch 10/14,
>> 
>> age_active_anon
>>   age_lru_gens
>>     try_walk_mm_list
>>       walk_mm_list
>>         walk_mm
>> 
>> So, in kswapd(), the page tables of many processes may be scanned
>> fully.  If the number of processes that are active are high, the
>> overhead may be high too.
>
> That's correct. Just in case we have different definitions of what we
> call "full":
>
>   I understand it as the full range of the address space of a process
>   that was loaded by switch_mm() at least once since the last scan.
>   This is not the case because we don't scan the full range -- we skip
>   holes and VMAs that are unevictable, as well as PTE tables that have
>   no accessed entries on x86_64, by should_skip_vma() and
>   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG.
>
>   If you are referring to the full range of PTE tables that have at
>   least one accessed entry, i.e., other 511 are not none  but have not
>   been accessed either since the last scan on x86_64, then yes, you
>   are right again :) This is the worse case scenario.

OK.  So there's no fundamental difference between us on this.

>> > This series uses page tables to discover page accesses when a system
>> > has run out of inactive pages. Under such a situation, the system is
>> > very likely to have a lot of page accesses, and using the rmap is
>> > likely to cost a lot more because its poor memory locality compared
>> > with page tables.
>> 
>> This is the theory.  Can you verify this with more data?  Including the
>> CPU cycles or time spent scanning page tables?
>
> Yes, I'll be happy to do so as I should, because page table scanning
> is counterintuitive. Let me add more theory in case it's still unclear
> to others.
>
> From my understanding, the two fundamental questions we need to
> consider in terms of page reclaim are:
>
>   What are the sizes of hot clusters (spatial locality) should we
>   expect under memory pressure?
>
>   On smaller systems with 4GB memory, our observations are that the
>   average size of hot clusters found during each scan is 32KB. On
>   larger systems with hundreds of gigabytes of memory, it's well
>   above this value -- 512KB or larger. These values vary under
>   different workloads and with different memory allocators. Unless
>   done deliberately by memory allocators, e.g., Scudo as I've
>   mentioned earlier, it's safe to say if a PTE entry has been
>   accessed, its neighbors are likely to have been accessed too.
>
>   What's hot memory footprint (total size of hot clusters) should we
>   expect when we have run out of inactive pages?
>
>   Some numbers first: on large and heavily overcommitted systems, we
>   have observed close to 90% during a scan. Those systems have
>   millions of pages and using the rmap to find out which pages to
>   reclaim will just blow kswapd. On smaller systems with less memory
>   pressure (due to their weaker CPUs), this number is more reasonable,
>   ~50%. Here is some kswapd profiles from a smaller systems running
>   5.11:
>
>    the rmap                                 page table scan
>    ---------------------------------------------------------------------
>    31.03%  page_vma_mapped_walk             49.36%  lzo1x_1_do_compress
>    25.59%  lzo1x_1_do_compress               4.54%  page_vma_mapped_walk
>     4.63%  do_raw_spin_lock                  4.45%  memset_erms
>     3.89%  vma_interval_tree_iter_next       3.47%  walk_pte_range
>     3.33%  vma_interval_tree_subtree_search  2.88%  zram_bvec_rw
>
>   The page table scan is only twice as fast. Only larger systems,
>   it's usually more than 4 times, without THP. With THP, both are
>   negligible (<1% CPU usage). I can grab profiles from our servers
>   too if you are interested in seeing them on 4.15 kernel.

Yes.  On a heavily overcommitted systems with high-percent hot pages,
the page table scanning works much better.  Because almost all pages
(and their mappings) will be scanned finally.

But on a not-so-heavily overcommitted system with low-percent hot pages,
it's possible that rmap scanning works better.  That is, only a small
fraction of the pages need to be scanned.  I know that the page table
scanning may still work better in many cases.

And another possibility, on a system with cool instead of completely
cold pages, that is, some pages are accessed at quite low frequency, but
not 0, there will be always some low-bandwidth memory reclaiming.  That
is, it's impossible to find a perfect solution with one or two full
scanning.  But we need to reclaim some pages periodically.  And I guess
there are no perfect (or very good) page reclaiming solutions for some
other situations too. Where what we can do are,

- Avoid OOM, that is, reclaim some pages if possible.

- Control the overhead of the page reclaiming.

But this is theory only.  If anyone can point out that they are not
realistic at all, it's good too :-)

>> > But, page tables can be sparse too, in terms of hot memory tracking.
>> > Dave has asked me to test the worst case scenario, which I'll do.
>> > And I'd be happy to share more data. Any specific workload you are
>> > interested in?
>> 
>> We can start with some simple workloads that are easier to be reasoned.
>> For example,
>> 
>> 1. Run the workload with hot and cold pages, when the free memory
>> becomes lower than the low watermark, kswapd will be waken up to scan
>> and reclaim some cold pages.  How long will it take to do that?  It's
>> expected that almost all pages need to be scanned, so that page table
>
> A typical scenario. Otherwise why would we have run out of cold pages
> and still be under memory? Because what's in memory is hot and
> therefore most of the them need to be scanned :)
>
>> scanning is expected to have less overhead.  We can measure how well it
>> is.
>
> Sounds good to me.
>
>> 2. Run the workload with hot and cold pages, if the whole working-set
>> cannot fit in DRAM, that is, the cold pages will be reclaimed and
>> swapped in regularly (for example tens MB/s).  It's expected that less
>> pages may be scanned with rmap, but the speed of page table scanning is
>> faster.
>
> So IIUC, this is a sustained memory pressure, i.e., servers constantly
> running under memory pressure?

Yes.  The system can accommodate more workloads at the cost of
performance, as long as the end-user latency isn't unacceptable.  Or we
need some time to schedule more computing resources, so we need to run
in this condition for some while.

But again, this is theory only.  I am glad if people can tell me that
this is unrealistic.

>> 3. Run the workload with hot and cold pages, the system is
>> overcommitted, that is, some cold pages will be placed in swap.  But the
>> cold pages are cold enough, so there's almost no thrashing.  Then the
>> hot working-set of the workload changes, that is, some hot pages become
>> cold, while some cold pages becomes hot, so page reclaiming and swapin
>> will be triggered.
>
> This is usually what we see on clients, i.e., bursty workloads when
> switching from an active app to an inactive one.

Thanks for your information.  Now I know a typical realistic use case :-)

>> For each cases, we can use some different parameters.  And we can
>> measure something like the number of pages scanned, the time taken to
>> scan them, the number of page reclaimed and swapped in, etc.
>
> Thanks, I appreciate these -- very well thought test cases. I'll look
> into them and probably write some synthetic test cases. If you have
> some already, I'd love to get my hands one them.

Sorry.  I have no test cases in hand.  Maybe we can add some into
Fengguang's vm-scalability test suite as follows.

https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/

Best Regards,
Huang, Ying
Yu Zhao April 10, 2021, 6:48 p.m. UTC | #11
On Wed, Mar 24, 2021 at 12:58 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote:
> >> Yu Zhao <yuzhao@google.com> writes:
> >>
> >> > On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
> >> >> Yu Zhao <yuzhao@google.com> writes:
> >> >>
> >> >> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
> >> >> > The scanning overhead is only one of the two major problems of the
> >> >> > current page reclaim. The other problem is the granularity of the
> >> >> > active/inactive (sizes). We stopped using them in making job
> >> >> > scheduling decision a long time ago. I know another large internet
> >> >> > company adopted a similar approach as ours, and I'm wondering how
> >> >> > everybody else is coping with the discrepancy from those counters.
> >> >>
> >> >> From intuition, the scanning overhead of the full page table scanning
> >> >> appears higher than that of the rmap scanning for a small portion of
> >> >> system memory.  But form your words, you think the reality is the
> >> >> reverse?  If others concern about the overhead too, finally, I think you
> >> >> need to prove the overhead of the page table scanning isn't too higher,
> >> >> or even lower with more data and theory.
> >> >
> >> > There is a misunderstanding here. I never said anything about full
> >> > page table scanning. And this is not how it's done in this series
> >> > either. I guess the misunderstanding has something to do with the cold
> >> > memory tracking you are thinking about?
> >>
> >> If my understanding were correct, from the following code path in your
> >> patch 10/14,
> >>
> >> age_active_anon
> >>   age_lru_gens
> >>     try_walk_mm_list
> >>       walk_mm_list
> >>         walk_mm
> >>
> >> So, in kswapd(), the page tables of many processes may be scanned
> >> fully.  If the number of processes that are active are high, the
> >> overhead may be high too.
> >
> > That's correct. Just in case we have different definitions of what we
> > call "full":
> >
> >   I understand it as the full range of the address space of a process
> >   that was loaded by switch_mm() at least once since the last scan.
> >   This is not the case because we don't scan the full range -- we skip
> >   holes and VMAs that are unevictable, as well as PTE tables that have
> >   no accessed entries on x86_64, by should_skip_vma() and
> >   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG.
> >
> >   If you are referring to the full range of PTE tables that have at
> >   least one accessed entry, i.e., other 511 are not none  but have not
> >   been accessed either since the last scan on x86_64, then yes, you
> >   are right again :) This is the worse case scenario.
>
> OK.  So there's no fundamental difference between us on this.
>
> >> > This series uses page tables to discover page accesses when a system
> >> > has run out of inactive pages. Under such a situation, the system is
> >> > very likely to have a lot of page accesses, and using the rmap is
> >> > likely to cost a lot more because its poor memory locality compared
> >> > with page tables.
> >>
> >> This is the theory.  Can you verify this with more data?  Including the
> >> CPU cycles or time spent scanning page tables?
> >
> > Yes, I'll be happy to do so as I should, because page table scanning
> > is counterintuitive. Let me add more theory in case it's still unclear
> > to others.
> >
> > From my understanding, the two fundamental questions we need to
> > consider in terms of page reclaim are:
> >
> >   What are the sizes of hot clusters (spatial locality) should we
> >   expect under memory pressure?
> >
> >   On smaller systems with 4GB memory, our observations are that the
> >   average size of hot clusters found during each scan is 32KB. On
> >   larger systems with hundreds of gigabytes of memory, it's well
> >   above this value -- 512KB or larger. These values vary under
> >   different workloads and with different memory allocators. Unless
> >   done deliberately by memory allocators, e.g., Scudo as I've
> >   mentioned earlier, it's safe to say if a PTE entry has been
> >   accessed, its neighbors are likely to have been accessed too.
> >
> >   What's hot memory footprint (total size of hot clusters) should we
> >   expect when we have run out of inactive pages?
> >
> >   Some numbers first: on large and heavily overcommitted systems, we
> >   have observed close to 90% during a scan. Those systems have
> >   millions of pages and using the rmap to find out which pages to
> >   reclaim will just blow kswapd. On smaller systems with less memory
> >   pressure (due to their weaker CPUs), this number is more reasonable,
> >   ~50%. Here is some kswapd profiles from a smaller systems running
> >   5.11:
> >
> >    the rmap                                 page table scan
> >    ---------------------------------------------------------------------
> >    31.03%  page_vma_mapped_walk             49.36%  lzo1x_1_do_compress
> >    25.59%  lzo1x_1_do_compress               4.54%  page_vma_mapped_walk
> >     4.63%  do_raw_spin_lock                  4.45%  memset_erms
> >     3.89%  vma_interval_tree_iter_next       3.47%  walk_pte_range
> >     3.33%  vma_interval_tree_subtree_search  2.88%  zram_bvec_rw
> >
> >   The page table scan is only twice as fast. Only larger systems,
> >   it's usually more than 4 times, without THP. With THP, both are
> >   negligible (<1% CPU usage). I can grab profiles from our servers
> >   too if you are interested in seeing them on 4.15 kernel.
>
> Yes.  On a heavily overcommitted systems with high-percent hot pages,
> the page table scanning works much better.  Because almost all pages
> (and their mappings) will be scanned finally.
>
> But on a not-so-heavily overcommitted system with low-percent hot pages,
> it's possible that rmap scanning works better.  That is, only a small
> fraction of the pages need to be scanned.  I know that the page table
> scanning may still work better in many cases.
>
> And another possibility, on a system with cool instead of completely
> cold pages, that is, some pages are accessed at quite low frequency, but
> not 0, there will be always some low-bandwidth memory reclaiming.  That
> is, it's impossible to find a perfect solution with one or two full
> scanning.  But we need to reclaim some pages periodically.  And I guess
> there are no perfect (or very good) page reclaiming solutions for some
> other situations too. Where what we can do are,
>
> - Avoid OOM, that is, reclaim some pages if possible.
>
> - Control the overhead of the page reclaiming.
>
> But this is theory only.  If anyone can point out that they are not
> realistic at all, it's good too :-)
>
> >> > But, page tables can be sparse too, in terms of hot memory tracking.
> >> > Dave has asked me to test the worst case scenario, which I'll do.
> >> > And I'd be happy to share more data. Any specific workload you are
> >> > interested in?
> >>
> >> We can start with some simple workloads that are easier to be reasoned.
> >> For example,
> >>
> >> 1. Run the workload with hot and cold pages, when the free memory
> >> becomes lower than the low watermark, kswapd will be waken up to scan
> >> and reclaim some cold pages.  How long will it take to do that?  It's
> >> expected that almost all pages need to be scanned, so that page table
> >
> > A typical scenario. Otherwise why would we have run out of cold pages
> > and still be under memory? Because what's in memory is hot and
> > therefore most of the them need to be scanned :)
> >
> >> scanning is expected to have less overhead.  We can measure how well it
> >> is.
> >
> > Sounds good to me.
> >
> >> 2. Run the workload with hot and cold pages, if the whole working-set
> >> cannot fit in DRAM, that is, the cold pages will be reclaimed and
> >> swapped in regularly (for example tens MB/s).  It's expected that less
> >> pages may be scanned with rmap, but the speed of page table scanning is
> >> faster.
> >
> > So IIUC, this is a sustained memory pressure, i.e., servers constantly
> > running under memory pressure?
>
> Yes.  The system can accommodate more workloads at the cost of
> performance, as long as the end-user latency isn't unacceptable.  Or we
> need some time to schedule more computing resources, so we need to run
> in this condition for some while.
>
> But again, this is theory only.  I am glad if people can tell me that
> this is unrealistic.
>
> >> 3. Run the workload with hot and cold pages, the system is
> >> overcommitted, that is, some cold pages will be placed in swap.  But the
> >> cold pages are cold enough, so there's almost no thrashing.  Then the
> >> hot working-set of the workload changes, that is, some hot pages become
> >> cold, while some cold pages becomes hot, so page reclaiming and swapin
> >> will be triggered.
> >
> > This is usually what we see on clients, i.e., bursty workloads when
> > switching from an active app to an inactive one.
>
> Thanks for your information.  Now I know a typical realistic use case :-)
>
> >> For each cases, we can use some different parameters.  And we can
> >> measure something like the number of pages scanned, the time taken to
> >> scan them, the number of page reclaimed and swapped in, etc.
> >
> > Thanks, I appreciate these -- very well thought test cases. I'll look
> > into them and probably write some synthetic test cases. If you have
> > some already, I'd love to get my hands one them.
>
> Sorry.  I have no test cases in hand.  Maybe we can add some into
> Fengguang's vm-scalability test suite as follows.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/

Hi Ying,

I'm still investigating the test cases you suggested. I'm also
wondering if it's possible to test the next version, which I'll post
soon, with Intel's 0-Day infra.

Thanks.
Huang, Ying April 13, 2021, 3:06 a.m. UTC | #12
Yu Zhao <yuzhao@google.com> writes:

> On Wed, Mar 24, 2021 at 12:58 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yu Zhao <yuzhao@google.com> writes:
>>
>> > On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote:
>> >> Yu Zhao <yuzhao@google.com> writes:
>> >>
>> >> > On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
>> >> >> Yu Zhao <yuzhao@google.com> writes:
>> >> >>
>> >> >> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
>> >> >> > The scanning overhead is only one of the two major problems of the
>> >> >> > current page reclaim. The other problem is the granularity of the
>> >> >> > active/inactive (sizes). We stopped using them in making job
>> >> >> > scheduling decision a long time ago. I know another large internet
>> >> >> > company adopted a similar approach as ours, and I'm wondering how
>> >> >> > everybody else is coping with the discrepancy from those counters.
>> >> >>
>> >> >> From intuition, the scanning overhead of the full page table scanning
>> >> >> appears higher than that of the rmap scanning for a small portion of
>> >> >> system memory.  But form your words, you think the reality is the
>> >> >> reverse?  If others concern about the overhead too, finally, I think you
>> >> >> need to prove the overhead of the page table scanning isn't too higher,
>> >> >> or even lower with more data and theory.
>> >> >
>> >> > There is a misunderstanding here. I never said anything about full
>> >> > page table scanning. And this is not how it's done in this series
>> >> > either. I guess the misunderstanding has something to do with the cold
>> >> > memory tracking you are thinking about?
>> >>
>> >> If my understanding were correct, from the following code path in your
>> >> patch 10/14,
>> >>
>> >> age_active_anon
>> >>   age_lru_gens
>> >>     try_walk_mm_list
>> >>       walk_mm_list
>> >>         walk_mm
>> >>
>> >> So, in kswapd(), the page tables of many processes may be scanned
>> >> fully.  If the number of processes that are active are high, the
>> >> overhead may be high too.
>> >
>> > That's correct. Just in case we have different definitions of what we
>> > call "full":
>> >
>> >   I understand it as the full range of the address space of a process
>> >   that was loaded by switch_mm() at least once since the last scan.
>> >   This is not the case because we don't scan the full range -- we skip
>> >   holes and VMAs that are unevictable, as well as PTE tables that have
>> >   no accessed entries on x86_64, by should_skip_vma() and
>> >   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG.
>> >
>> >   If you are referring to the full range of PTE tables that have at
>> >   least one accessed entry, i.e., other 511 are not none  but have not
>> >   been accessed either since the last scan on x86_64, then yes, you
>> >   are right again :) This is the worse case scenario.
>>
>> OK.  So there's no fundamental difference between us on this.
>>
>> >> > This series uses page tables to discover page accesses when a system
>> >> > has run out of inactive pages. Under such a situation, the system is
>> >> > very likely to have a lot of page accesses, and using the rmap is
>> >> > likely to cost a lot more because its poor memory locality compared
>> >> > with page tables.
>> >>
>> >> This is the theory.  Can you verify this with more data?  Including the
>> >> CPU cycles or time spent scanning page tables?
>> >
>> > Yes, I'll be happy to do so as I should, because page table scanning
>> > is counterintuitive. Let me add more theory in case it's still unclear
>> > to others.
>> >
>> > From my understanding, the two fundamental questions we need to
>> > consider in terms of page reclaim are:
>> >
>> >   What are the sizes of hot clusters (spatial locality) should we
>> >   expect under memory pressure?
>> >
>> >   On smaller systems with 4GB memory, our observations are that the
>> >   average size of hot clusters found during each scan is 32KB. On
>> >   larger systems with hundreds of gigabytes of memory, it's well
>> >   above this value -- 512KB or larger. These values vary under
>> >   different workloads and with different memory allocators. Unless
>> >   done deliberately by memory allocators, e.g., Scudo as I've
>> >   mentioned earlier, it's safe to say if a PTE entry has been
>> >   accessed, its neighbors are likely to have been accessed too.
>> >
>> >   What's hot memory footprint (total size of hot clusters) should we
>> >   expect when we have run out of inactive pages?
>> >
>> >   Some numbers first: on large and heavily overcommitted systems, we
>> >   have observed close to 90% during a scan. Those systems have
>> >   millions of pages and using the rmap to find out which pages to
>> >   reclaim will just blow kswapd. On smaller systems with less memory
>> >   pressure (due to their weaker CPUs), this number is more reasonable,
>> >   ~50%. Here is some kswapd profiles from a smaller systems running
>> >   5.11:
>> >
>> >    the rmap                                 page table scan
>> >    ---------------------------------------------------------------------
>> >    31.03%  page_vma_mapped_walk             49.36%  lzo1x_1_do_compress
>> >    25.59%  lzo1x_1_do_compress               4.54%  page_vma_mapped_walk
>> >     4.63%  do_raw_spin_lock                  4.45%  memset_erms
>> >     3.89%  vma_interval_tree_iter_next       3.47%  walk_pte_range
>> >     3.33%  vma_interval_tree_subtree_search  2.88%  zram_bvec_rw
>> >
>> >   The page table scan is only twice as fast. Only larger systems,
>> >   it's usually more than 4 times, without THP. With THP, both are
>> >   negligible (<1% CPU usage). I can grab profiles from our servers
>> >   too if you are interested in seeing them on 4.15 kernel.
>>
>> Yes.  On a heavily overcommitted systems with high-percent hot pages,
>> the page table scanning works much better.  Because almost all pages
>> (and their mappings) will be scanned finally.
>>
>> But on a not-so-heavily overcommitted system with low-percent hot pages,
>> it's possible that rmap scanning works better.  That is, only a small
>> fraction of the pages need to be scanned.  I know that the page table
>> scanning may still work better in many cases.
>>
>> And another possibility, on a system with cool instead of completely
>> cold pages, that is, some pages are accessed at quite low frequency, but
>> not 0, there will be always some low-bandwidth memory reclaiming.  That
>> is, it's impossible to find a perfect solution with one or two full
>> scanning.  But we need to reclaim some pages periodically.  And I guess
>> there are no perfect (or very good) page reclaiming solutions for some
>> other situations too. Where what we can do are,
>>
>> - Avoid OOM, that is, reclaim some pages if possible.
>>
>> - Control the overhead of the page reclaiming.
>>
>> But this is theory only.  If anyone can point out that they are not
>> realistic at all, it's good too :-)
>>
>> >> > But, page tables can be sparse too, in terms of hot memory tracking.
>> >> > Dave has asked me to test the worst case scenario, which I'll do.
>> >> > And I'd be happy to share more data. Any specific workload you are
>> >> > interested in?
>> >>
>> >> We can start with some simple workloads that are easier to be reasoned.
>> >> For example,
>> >>
>> >> 1. Run the workload with hot and cold pages, when the free memory
>> >> becomes lower than the low watermark, kswapd will be waken up to scan
>> >> and reclaim some cold pages.  How long will it take to do that?  It's
>> >> expected that almost all pages need to be scanned, so that page table
>> >
>> > A typical scenario. Otherwise why would we have run out of cold pages
>> > and still be under memory? Because what's in memory is hot and
>> > therefore most of the them need to be scanned :)
>> >
>> >> scanning is expected to have less overhead.  We can measure how well it
>> >> is.
>> >
>> > Sounds good to me.
>> >
>> >> 2. Run the workload with hot and cold pages, if the whole working-set
>> >> cannot fit in DRAM, that is, the cold pages will be reclaimed and
>> >> swapped in regularly (for example tens MB/s).  It's expected that less
>> >> pages may be scanned with rmap, but the speed of page table scanning is
>> >> faster.
>> >
>> > So IIUC, this is a sustained memory pressure, i.e., servers constantly
>> > running under memory pressure?
>>
>> Yes.  The system can accommodate more workloads at the cost of
>> performance, as long as the end-user latency isn't unacceptable.  Or we
>> need some time to schedule more computing resources, so we need to run
>> in this condition for some while.
>>
>> But again, this is theory only.  I am glad if people can tell me that
>> this is unrealistic.
>>
>> >> 3. Run the workload with hot and cold pages, the system is
>> >> overcommitted, that is, some cold pages will be placed in swap.  But the
>> >> cold pages are cold enough, so there's almost no thrashing.  Then the
>> >> hot working-set of the workload changes, that is, some hot pages become
>> >> cold, while some cold pages becomes hot, so page reclaiming and swapin
>> >> will be triggered.
>> >
>> > This is usually what we see on clients, i.e., bursty workloads when
>> > switching from an active app to an inactive one.
>>
>> Thanks for your information.  Now I know a typical realistic use case :-)
>>
>> >> For each cases, we can use some different parameters.  And we can
>> >> measure something like the number of pages scanned, the time taken to
>> >> scan them, the number of page reclaimed and swapped in, etc.
>> >
>> > Thanks, I appreciate these -- very well thought test cases. I'll look
>> > into them and probably write some synthetic test cases. If you have
>> > some already, I'd love to get my hands one them.
>>
>> Sorry.  I have no test cases in hand.  Maybe we can add some into
>> Fengguang's vm-scalability test suite as follows.
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/
>
> Hi Ying,
>
> I'm still investigating the test cases you suggested. I'm also
> wondering if it's possible to test the next version, which I'll post
> soon, with Intel's 0-Day infra.

Sure.  But now 0-Day has only quite limited coverage for swap testing.
Including the swap test in vm-scalability.git, and several test cases
with pmbench.  I think it's good to improve the coverage of 0-Day for
swap.  But it needs some time.

Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..c691d4d7720c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1008,6 +1008,7 @@  static int exec_mmap(struct mm_struct *mm)
 	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
 	tsk->mm = mm;
+	lru_gen_add_mm(mm);
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
@@ -1018,6 +1019,7 @@  static int exec_mmap(struct mm_struct *mm)
 	if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
 		local_irq_enable();
 	activate_mm(active_mm, mm);
+	lru_gen_switch_mm(active_mm, mm);
 	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
 		local_irq_enable();
 	tsk->mm->vmacache_seqnum = 0;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f325aeb4b4e8..591557c5b7e2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -335,6 +335,10 @@  struct mem_cgroup {
 	struct deferred_split deferred_split_queue;
 #endif
 
+#ifdef CONFIG_LRU_GEN
+	struct lru_gen_mm_list *mm_list;
+#endif
+
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0974ad501a47..b8a038a016f2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,6 +15,8 @@ 
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
+#include <linux/nodemask.h>
+#include <linux/mmdebug.h>
 
 #include <asm/mmu.h>
 
@@ -382,6 +384,8 @@  struct core_state {
 	struct completion startup;
 };
 
+#define ANON_AND_FILE 2
+
 struct kioctx_table;
 struct mm_struct {
 	struct {
@@ -560,6 +564,22 @@  struct mm_struct {
 
 #ifdef CONFIG_IOMMU_SUPPORT
 		u32 pasid;
+#endif
+#ifdef CONFIG_LRU_GEN
+		struct {
+			/* node of a global or per-memcg mm list */
+			struct list_head list;
+#ifdef CONFIG_MEMCG
+			/* points to memcg of the owner task above */
+			struct mem_cgroup *memcg;
+#endif
+			/* indicates this mm has been used since last walk */
+			nodemask_t nodes[ANON_AND_FILE];
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+			/* number of cpus that are using this mm */
+			atomic_t nr_cpus;
+#endif
+		} lru_gen;
 #endif
 	} __randomize_layout;
 
@@ -587,6 +607,121 @@  static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 	return (struct cpumask *)&mm->cpu_bitmap;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+struct lru_gen_mm_list {
+	/* head of a global or per-memcg mm list */
+	struct list_head head;
+	/* protects the list */
+	spinlock_t lock;
+	struct {
+		/* set to max_seq after each round of walk */
+		unsigned long cur_seq;
+		/* next mm on the list to walk */
+		struct list_head *iter;
+		/* to wait for last worker to finish */
+		struct wait_queue_head wait;
+		/* number of concurrent workers */
+		int nr_workers;
+	} nodes[0];
+};
+
+void lru_gen_init_mm(struct mm_struct *mm);
+void lru_gen_add_mm(struct mm_struct *mm);
+void lru_gen_del_mm(struct mm_struct *mm);
+#ifdef CONFIG_MEMCG
+int lru_gen_alloc_mm_list(struct mem_cgroup *memcg);
+void lru_gen_free_mm_list(struct mem_cgroup *memcg);
+void lru_gen_migrate_mm(struct mm_struct *mm);
+#endif
+
+/*
+ * Track usage so mms that haven't been used since last walk can be skipped.
+ *
+ * This function introduces a theoretical overhead for each mm switch, but it
+ * hasn't been measurable.
+ */
+static inline void lru_gen_switch_mm(struct mm_struct *old, struct mm_struct *new)
+{
+	int file;
+
+	/* exclude init_mm, efi_mm, etc. */
+	if (!core_kernel_data((unsigned long)old)) {
+		VM_BUG_ON(old == &init_mm);
+
+		for (file = 0; file < ANON_AND_FILE; file++)
+			nodes_setall(old->lru_gen.nodes[file]);
+
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+		atomic_dec(&old->lru_gen.nr_cpus);
+		VM_BUG_ON_MM(atomic_read(&old->lru_gen.nr_cpus) < 0, old);
+#endif
+	} else
+		VM_BUG_ON_MM(READ_ONCE(old->lru_gen.list.prev) ||
+			     READ_ONCE(old->lru_gen.list.next), old);
+
+	if (!core_kernel_data((unsigned long)new)) {
+		VM_BUG_ON(new == &init_mm);
+
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+		atomic_inc(&new->lru_gen.nr_cpus);
+		VM_BUG_ON_MM(atomic_read(&new->lru_gen.nr_cpus) < 0, new);
+#endif
+	} else
+		VM_BUG_ON_MM(READ_ONCE(new->lru_gen.list.prev) ||
+			     READ_ONCE(new->lru_gen.list.next), new);
+}
+
+/* Returns whether the mm is being used on any cpus. */
+static inline bool lru_gen_mm_is_active(struct mm_struct *mm)
+{
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	return !cpumask_empty(mm_cpumask(mm));
+#else
+	return atomic_read(&mm->lru_gen.nr_cpus);
+#endif
+}
+
+#else /* CONFIG_LRU_GEN */
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_add_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_del_mm(struct mm_struct *mm)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline int lru_gen_alloc_mm_list(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
+static inline void lru_gen_free_mm_list(struct mem_cgroup *memcg)
+{
+}
+
+static inline void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+}
+#endif
+
+static inline void lru_gen_switch_mm(struct mm_struct *old, struct mm_struct *new)
+{
+}
+
+static inline bool lru_gen_mm_is_active(struct mm_struct *mm)
+{
+	return false;
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47946cec7584..a99a1050565a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -285,8 +285,6 @@  static inline bool is_active_lru(enum lru_list lru)
 	return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE);
 }
 
-#define ANON_AND_FILE 2
-
 enum lruvec_flags {
 	LRUVEC_CONGESTED,		/* lruvec has many dirty pages
 					 * backed by a congested BDI
diff --git a/kernel/exit.c b/kernel/exit.c
index 04029e35e69a..e4292717ce37 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -422,6 +422,7 @@  void mm_update_next_owner(struct mm_struct *mm)
 		goto retry;
 	}
 	WRITE_ONCE(mm->owner, c);
+	lru_gen_migrate_mm(mm);
 	task_unlock(c);
 	put_task_struct(c);
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index d3171e8e88e5..e261b797955d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -665,6 +665,7 @@  static void check_mm(struct mm_struct *mm)
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	VM_BUG_ON_MM(mm->pmd_huge_pte, mm);
 #endif
+	VM_BUG_ON_MM(lru_gen_mm_is_active(mm), mm);
 }
 
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
@@ -1047,6 +1048,7 @@  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 		goto fail_nocontext;
 
 	mm->user_ns = get_user_ns(user_ns);
+	lru_gen_init_mm(mm);
 	return mm;
 
 fail_nocontext:
@@ -1089,6 +1091,7 @@  static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	lru_gen_del_mm(mm);
 	mmdrop(mm);
 }
 
@@ -2513,6 +2516,13 @@  pid_t kernel_clone(struct kernel_clone_args *args)
 		get_task_struct(p);
 	}
 
+	if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) {
+		/* lock p to synchronize with memcg migration */
+		task_lock(p);
+		lru_gen_add_mm(p->mm);
+		task_unlock(p);
+	}
+
 	wake_up_new_task(p);
 
 	/* forking complete and child started to run, tell ptracer */
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 1578973c5740..8da7767bb06a 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1303,6 +1303,7 @@  void kthread_use_mm(struct mm_struct *mm)
 	tsk->mm = mm;
 	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
+	lru_gen_switch_mm(active_mm, mm);
 	local_irq_enable();
 	task_unlock(tsk);
 #ifdef finish_arch_post_lock_switch
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca2bb629595f..56274a14ce09 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4308,6 +4308,7 @@  context_switch(struct rq *rq, struct task_struct *prev,
 		 * finish_task_switch()'s mmdrop().
 		 */
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+		lru_gen_switch_mm(prev->active_mm, next->mm);
 
 		if (!prev->mm) {                        // from kernel
 			/* will mmdrop() in finish_task_switch(). */
@@ -7599,6 +7600,7 @@  void idle_task_exit(void)
 
 	if (mm != &init_mm) {
 		switch_mm(mm, &init_mm, current);
+		lru_gen_switch_mm(mm, &init_mm);
 		finish_arch_post_lock_switch();
 	}
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 845eec01ef9d..5836780fe138 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5209,6 +5209,7 @@  static void __mem_cgroup_free(struct mem_cgroup *memcg)
 		free_mem_cgroup_per_node_info(memcg, node);
 	free_percpu(memcg->vmstats_percpu);
 	free_percpu(memcg->vmstats_local);
+	lru_gen_free_mm_list(memcg);
 	kfree(memcg);
 }
 
@@ -5261,6 +5262,9 @@  static struct mem_cgroup *mem_cgroup_alloc(void)
 		if (alloc_mem_cgroup_per_node_info(memcg, node))
 			goto fail;
 
+	if (lru_gen_alloc_mm_list(memcg))
+		goto fail;
+
 	if (memcg_wb_domain_init(memcg, GFP_KERNEL))
 		goto fail;
 
@@ -6165,6 +6169,29 @@  static void mem_cgroup_move_task(void)
 }
 #endif
 
+#ifdef CONFIG_LRU_GEN
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *css;
+	struct task_struct *task = NULL;
+
+	cgroup_taskset_for_each_leader(task, css, tset)
+		;
+
+	if (!task)
+		return;
+
+	task_lock(task);
+	if (task->mm && task->mm->owner == task)
+		lru_gen_migrate_mm(task->mm);
+	task_unlock(task);
+}
+#else
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+}
+#endif
+
 static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
 {
 	if (value == PAGE_COUNTER_MAX)
@@ -6505,6 +6532,7 @@  struct cgroup_subsys memory_cgrp_subsys = {
 	.css_free = mem_cgroup_css_free,
 	.css_reset = mem_cgroup_css_reset,
 	.can_attach = mem_cgroup_can_attach,
+	.attach = mem_cgroup_attach,
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.post_attach = mem_cgroup_move_task,
 	.dfl_cftypes = memory_files,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1a24d2e0a4cb..f7657ab0d4b7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4314,3 +4314,266 @@  void check_move_unevictable_pages(struct pagevec *pvec)
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
+
+#ifdef CONFIG_LRU_GEN
+
+/******************************************************************************
+ *                           global and per-memcg mm list
+ ******************************************************************************/
+
+/*
+ * After pages are faulted in, they become the youngest generation. They must
+ * go through aging process twice before they can be evicted. After first scan,
+ * their accessed bit set during initial faults are cleared and they become the
+ * second youngest generation. And second scan makes sure they haven't been used
+ * since the first.
+ */
+#define MIN_NR_GENS 2
+
+static struct lru_gen_mm_list *global_mm_list;
+
+static struct lru_gen_mm_list *alloc_mm_list(void)
+{
+	int nid;
+	struct lru_gen_mm_list *mm_list;
+
+	mm_list = kzalloc(struct_size(mm_list, nodes, nr_node_ids), GFP_KERNEL);
+	if (!mm_list)
+		return NULL;
+
+	INIT_LIST_HEAD(&mm_list->head);
+	spin_lock_init(&mm_list->lock);
+
+	for_each_node(nid) {
+		mm_list->nodes[nid].cur_seq = MIN_NR_GENS - 1;
+		mm_list->nodes[nid].iter = &mm_list->head;
+		init_waitqueue_head(&mm_list->nodes[nid].wait);
+	}
+
+	return mm_list;
+}
+
+static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg)
+{
+#ifdef CONFIG_MEMCG
+	if (!mem_cgroup_disabled())
+		return memcg ? memcg->mm_list : root_mem_cgroup->mm_list;
+#endif
+	VM_BUG_ON(memcg);
+
+	return global_mm_list;
+}
+
+void lru_gen_init_mm(struct mm_struct *mm)
+{
+	int file;
+
+	INIT_LIST_HEAD(&mm->lru_gen.list);
+#ifdef CONFIG_MEMCG
+	mm->lru_gen.memcg = NULL;
+#endif
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	atomic_set(&mm->lru_gen.nr_cpus, 0);
+#endif
+	for (file = 0; file < ANON_AND_FILE; file++)
+		nodes_clear(mm->lru_gen.nodes[file]);
+}
+
+void lru_gen_add_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+
+	VM_BUG_ON_MM(!list_empty(&mm->lru_gen.list), mm);
+#ifdef CONFIG_MEMCG
+	VM_BUG_ON_MM(mm->lru_gen.memcg, mm);
+	WRITE_ONCE(mm->lru_gen.memcg, memcg);
+#endif
+	spin_lock(&mm_list->lock);
+	list_add_tail(&mm->lru_gen.list, &mm_list->head);
+	spin_unlock(&mm_list->lock);
+}
+
+void lru_gen_del_mm(struct mm_struct *mm)
+{
+	int nid;
+#ifdef CONFIG_MEMCG
+	struct lru_gen_mm_list *mm_list = get_mm_list(mm->lru_gen.memcg);
+#else
+	struct lru_gen_mm_list *mm_list = get_mm_list(NULL);
+#endif
+
+	spin_lock(&mm_list->lock);
+
+	for_each_node(nid) {
+		if (mm_list->nodes[nid].iter != &mm->lru_gen.list)
+			continue;
+
+		mm_list->nodes[nid].iter = mm_list->nodes[nid].iter->next;
+		if (mm_list->nodes[nid].iter == &mm_list->head)
+			WRITE_ONCE(mm_list->nodes[nid].cur_seq,
+				   mm_list->nodes[nid].cur_seq + 1);
+	}
+
+	list_del_init(&mm->lru_gen.list);
+
+	spin_unlock(&mm_list->lock);
+
+#ifdef CONFIG_MEMCG
+	mem_cgroup_put(mm->lru_gen.memcg);
+	WRITE_ONCE(mm->lru_gen.memcg, NULL);
+#endif
+}
+
+#ifdef CONFIG_MEMCG
+int lru_gen_alloc_mm_list(struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+
+	memcg->mm_list = alloc_mm_list();
+
+	return memcg->mm_list ? 0 : -ENOMEM;
+}
+
+void lru_gen_free_mm_list(struct mem_cgroup *memcg)
+{
+	kfree(memcg->mm_list);
+	memcg->mm_list = NULL;
+}
+
+void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg;
+
+	lockdep_assert_held(&mm->owner->alloc_lock);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(mm->owner);
+	rcu_read_unlock();
+	if (memcg == mm->lru_gen.memcg)
+		return;
+
+	VM_BUG_ON_MM(!mm->lru_gen.memcg, mm);
+	VM_BUG_ON_MM(list_empty(&mm->lru_gen.list), mm);
+
+	lru_gen_del_mm(mm);
+	lru_gen_add_mm(mm);
+}
+
+static bool mm_has_migrated(struct mm_struct *mm, struct mem_cgroup *memcg)
+{
+	return READ_ONCE(mm->lru_gen.memcg) != memcg;
+}
+#else
+static bool mm_has_migrated(struct mm_struct *mm, struct mem_cgroup *memcg)
+{
+	return false;
+}
+#endif
+
+static bool should_skip_mm(struct mm_struct *mm, int nid, int swappiness)
+{
+	int file;
+	unsigned long size = 0;
+
+	if (mm_is_oom_victim(mm))
+		return true;
+
+	for (file = !swappiness; file < ANON_AND_FILE; file++) {
+		if (lru_gen_mm_is_active(mm) || node_isset(nid, mm->lru_gen.nodes[file]))
+			size += file ? get_mm_counter(mm, MM_FILEPAGES) :
+				       get_mm_counter(mm, MM_ANONPAGES) +
+				       get_mm_counter(mm, MM_SHMEMPAGES);
+	}
+
+	if (size < SWAP_CLUSTER_MAX)
+		return true;
+
+	return !mmget_not_zero(mm);
+}
+
+/* To support multiple workers that concurrently walk mm list. */
+static bool get_next_mm(struct lruvec *lruvec, unsigned long next_seq,
+			int swappiness, struct mm_struct **iter)
+{
+	bool last = true;
+	struct mm_struct *mm = NULL;
+	int nid = lruvec_pgdat(lruvec)->node_id;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+
+	if (*iter)
+		mmput_async(*iter);
+	else if (next_seq <= READ_ONCE(mm_list->nodes[nid].cur_seq))
+		return false;
+
+	spin_lock(&mm_list->lock);
+
+	VM_BUG_ON(next_seq > mm_list->nodes[nid].cur_seq + 1);
+	VM_BUG_ON(*iter && next_seq < mm_list->nodes[nid].cur_seq);
+	VM_BUG_ON(*iter && !mm_list->nodes[nid].nr_workers);
+
+	if (next_seq <= mm_list->nodes[nid].cur_seq) {
+		last = *iter;
+		goto done;
+	}
+
+	if (mm_list->nodes[nid].iter == &mm_list->head) {
+		VM_BUG_ON(*iter || mm_list->nodes[nid].nr_workers);
+		mm_list->nodes[nid].iter = mm_list->nodes[nid].iter->next;
+	}
+
+	while (!mm && mm_list->nodes[nid].iter != &mm_list->head) {
+		mm = list_entry(mm_list->nodes[nid].iter, struct mm_struct, lru_gen.list);
+		mm_list->nodes[nid].iter = mm_list->nodes[nid].iter->next;
+		if (should_skip_mm(mm, nid, swappiness))
+			mm = NULL;
+	}
+
+	if (mm_list->nodes[nid].iter == &mm_list->head)
+		WRITE_ONCE(mm_list->nodes[nid].cur_seq,
+			   mm_list->nodes[nid].cur_seq + 1);
+done:
+	if (*iter && !mm)
+		mm_list->nodes[nid].nr_workers--;
+	if (!*iter && mm)
+		mm_list->nodes[nid].nr_workers++;
+
+	last = last && !mm_list->nodes[nid].nr_workers &&
+	       mm_list->nodes[nid].iter == &mm_list->head;
+
+	spin_unlock(&mm_list->lock);
+
+	*iter = mm;
+
+	return last;
+}
+
+/******************************************************************************
+ *                          initialization
+ ******************************************************************************/
+
+static int __init init_lru_gen(void)
+{
+	if (mem_cgroup_disabled()) {
+		global_mm_list = alloc_mm_list();
+		if (!global_mm_list) {
+			pr_err("lru_gen: failed to allocate global mm list\n");
+			return -ENOMEM;
+		}
+	}
+
+	return 0;
+};
+/*
+ * We want to run as early as possible because some debug code, e.g.,
+ * dma_resv_lockdep(), calls mm_alloc() and mmput(). We only depend on mm_kobj,
+ * which is initialized one stage earlier by postcore_initcall().
+ */
+arch_initcall(init_lru_gen);
+
+#endif /* CONFIG_LRU_GEN */