[v9,05/14] mm: multi-gen LRU: groundwork

Message ID	20220309021230.721028-6-yuzhao@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> Date: Tue, 8 Mar 2022 19:12:22 -0700 In-Reply-To: <20220309021230.721028-1-yuzhao@google.com> Message-Id: <20220309021230.721028-6-yuzhao@google.com> Mime-Version: 1.0 References: <20220309021230.721028-1-yuzhao@google.com> Subject: [PATCH v9 05/14] mm: multi-gen LRU: groundwork From: Yu Zhao <yuzhao@google.com> To: Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org> Cc: Andi Kleen <ak@linux.intel.com>, Aneesh Kumar <aneesh.kumar@linux.ibm.com>, Catalin Marinas <catalin.marinas@arm.com>, Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>, Jens Axboe <axboe@kernel.dk>, Jesse Barnes <jsbarnes@google.com>, Johannes Weiner <hannes@cmpxchg.org>, Jonathan Corbet <corbet@lwn.net>, Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>, Michael Larabel <Michael@michaellarabel.com>, Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>, Rik van Riel <riel@surriel.com>, Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>, Ying Huang <ying.huang@intel.com>, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, page-reclaim@google.com, x86@kernel.org, Yu Zhao <yuzhao@google.com>, Brian Geffon <bgeffon@google.com>, Jan Alexander Steffens <heftig@archlinux.org>, Oleksandr Natalenko <oleksandr@natalenko.name>, Steven Barrett <steven@liquorix.net>, Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>, Donald Carr <d@chaos-reins.com>, " =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>, Konstantin Kharlamov <Hi-Angel@yandex.ru>, Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>, Vaibhav Jain <vaibhav@linux.ibm.com> Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	Multi-Gen LRU Framework \| expand [v9,00/14] Multi-Gen LRU Framework [v9,01/14] mm: x86, arm64: add arch_has_hw_pte_young() [v9,02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG [v9,03/14] mm/vmscan.c: refactor shrink_node() [v9,04/14] Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" [v9,05/14] mm: multi-gen LRU: groundwork [v9,06/14] mm: multi-gen LRU: minimal implementation [v9,07/14] mm: multi-gen LRU: exploit locality in rmap [v9,08/14] mm: multi-gen LRU: support page table walks [v9,09/14] mm: multi-gen LRU: optimize multiple memcgs [v9,10/14] mm: multi-gen LRU: kill switch [v9,11/14] mm: multi-gen LRU: thrashing prevention [v9,12/14] mm: multi-gen LRU: debugfs interface [v9,13/14] mm: multi-gen LRU: admin guide [v9,14/14] mm: multi-gen LRU: design doc

Yu Zhao March 9, 2022, 2:12 a.m. UTC

Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of
working set estimation and proactive reclaim. These features are
required to optimize job scheduling (bin packing) in data centers. The
variable size of the sliding window is designed for such use cases
[1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive"
will be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking the rendering
   threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or
VM_RAND_READ is present; the latter channel is assumed to follow the
latter pattern unless outlying refaults have been observed.

The next patch will address the "outlying refaults". A few macros,
i.e., LRU_REFS_*, used later are added in this patch to make the
patchset less diffy.

A page is added to the youngest generation on faulting. The aging
needs to check the accessed bit at least twice before handing this
page over to the eviction. The first check takes care of the accessed
bit set on the initial fault; the second check makes sure this page
has not been used since then. This protocol, AKA second chance,
requires a minimum of two generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 fs/fuse/dev.c                     |   3 +-
 include/linux/mm.h                |   2 +
 include/linux/mm_inline.h         | 176 ++++++++++++++++++++++++++++++
 include/linux/mmzone.h            |  94 ++++++++++++++++
 include/linux/page-flags-layout.h |  11 +-
 include/linux/page-flags.h        |   4 +-
 include/linux/sched.h             |   4 +
 kernel/bounds.c                   |   7 ++
 mm/Kconfig                        |  10 ++
 mm/huge_memory.c                  |   3 +-
 mm/memcontrol.c                   |   2 +
 mm/memory.c                       |  25 +++++
 mm/mm_init.c                      |   6 +-
 mm/mmzone.c                       |   2 +
 mm/swap.c                         |   9 +-
 mm/vmscan.c                       |  73 +++++++++++++
 16 files changed, 418 insertions(+), 13 deletions(-)

Huang, Ying March 14, 2022, 8:08 a.m. UTC | #1

Hi, Yu,

Yu Zhao <yuzhao@google.com> writes:
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 3326ee3903f3..747ab1690bcf 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -892,6 +892,16 @@ config ANON_VMA_NAME
>  	  area from being merged with adjacent virtual memory areas due to the
>  	  difference in their name.
>  
> +# the multi-gen LRU {
> +config LRU_GEN
> +	bool "Multi-Gen LRU"
> +	depends on MMU
> +	# the following options can use up the spare bits in page flags
> +	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)

LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
by LRU_GEN?

> +	help
> +	  A high performance LRU implementation for memory overcommit.
> +# }
> +
>  source "mm/damon/Kconfig"
>  
>  endmenu

Best Regards,
Huang, Ying

[snip]

Yu Zhao March 14, 2022, 9:30 a.m. UTC | #2

On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Yu,
>
> Yu Zhao <yuzhao@google.com> writes:
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 3326ee3903f3..747ab1690bcf 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -892,6 +892,16 @@ config ANON_VMA_NAME
> >         area from being merged with adjacent virtual memory areas due to the
> >         difference in their name.
> >
> > +# the multi-gen LRU {
> > +config LRU_GEN
> > +     bool "Multi-Gen LRU"
> > +     depends on MMU
> > +     # the following options can use up the spare bits in page flags
> > +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
>
> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
> by LRU_GEN?

LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
max number. The dependency is with NODES_SHIFT selected by MAXSMP:
    default "10" if MAXSMP
This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.

MAXSMP is meant for kernel developers to test their code, and it
should not be used in production [1]. But some distros unfortunately
ship kernels built with this option, e.g., Fedora and Ubuntu. And
their users reported build errors to me after they applied MGLRU on
those kernels ("Not enough bits in page flags"). Let me add Fedora and
Ubuntu to this thread.

Fedora and Ubuntu,

Could you please clarify if there is a reason to ship kernels built
with MAXSMP? Otherwise, please consider disabling this option. Thanks.

As per above, MAXSMP enables ridiculously large numbers of CPUs and
NUMA nodes for testing purposes. It is detrimental to performance,
e.g., CPUMASK_OFFSTACK.

[1] https://lore.kernel.org/lkml/20131106055634.GA24044@gmail.com/

Huang, Ying March 15, 2022, 12:34 a.m. UTC | #3

Yu Zhao <yuzhao@google.com> writes:

> On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hi, Yu,
>>
>> Yu Zhao <yuzhao@google.com> writes:
>> > diff --git a/mm/Kconfig b/mm/Kconfig
>> > index 3326ee3903f3..747ab1690bcf 100644
>> > --- a/mm/Kconfig
>> > +++ b/mm/Kconfig
>> > @@ -892,6 +892,16 @@ config ANON_VMA_NAME
>> >         area from being merged with adjacent virtual memory areas due to the
>> >         difference in their name.
>> >
>> > +# the multi-gen LRU {
>> > +config LRU_GEN
>> > +     bool "Multi-Gen LRU"
>> > +     depends on MMU
>> > +     # the following options can use up the spare bits in page flags
>> > +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
>>
>> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
>> by LRU_GEN?
>
> LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
> max number. The dependency is with NODES_SHIFT selected by MAXSMP:
>     default "10" if MAXSMP
> This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.

From the following code snippets from page-flags-layout.h,
LAST_CPUPID_SHIFT is related to NR_CPUS instead of NODES_SHIFT.

#define LAST__PID_SHIFT 8
#define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)

#define LAST__CPU_SHIFT NR_CPUS_BITS
#define LAST__CPU_MASK  ((1 << LAST__CPU_SHIFT)-1)

#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT)

Best Regards,
Huang, Ying

[snip]

Yu Zhao March 15, 2022, 12:50 a.m. UTC | #4

On Mon, Mar 14, 2022 at 6:34 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Yu,
> >>
> >> Yu Zhao <yuzhao@google.com> writes:
> >> > diff --git a/mm/Kconfig b/mm/Kconfig
> >> > index 3326ee3903f3..747ab1690bcf 100644
> >> > --- a/mm/Kconfig
> >> > +++ b/mm/Kconfig
> >> > @@ -892,6 +892,16 @@ config ANON_VMA_NAME
> >> >         area from being merged with adjacent virtual memory areas due to the
> >> >         difference in their name.
> >> >
> >> > +# the multi-gen LRU {
> >> > +config LRU_GEN
> >> > +     bool "Multi-Gen LRU"
> >> > +     depends on MMU
> >> > +     # the following options can use up the spare bits in page flags
> >> > +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> >>
> >> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
> >> by LRU_GEN?
> >
> > LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
> > max number. The dependency is with NODES_SHIFT selected by MAXSMP:
> >     default "10" if MAXSMP
> > This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.
>
> From the following code snippets from page-flags-layout.h,
> LAST_CPUPID_SHIFT is related to NR_CPUS instead of NODES_SHIFT.

It is. But LAST_CPUPID_NOT_IN_PAGE_FLAGS should always work but
NODE_NOT_IN_PAGE_FLAGS doesn't.

Barry Song March 16, 2022, 11:25 p.m. UTC | #5

On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Evictable pages are divided into multiple generations for each lruvec.
> The youngest generation number is stored in lrugen->max_seq for both
> anon and file types as they are aged on an equal footing. The oldest
> generation numbers are stored in lrugen->min_seq[] separately for anon
> and file types as clean file pages can be evicted regardless of swap
> constraints. These three variables are monotonically increasing.
>
> Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
> in order to fit into the gen counter in folio->flags. Each truncated
> generation number is an index to lrugen->lists[]. The sliding window
> technique is used to track at least MIN_NR_GENS and at most
> MAX_NR_GENS generations. The gen counter stores a value within [1,
> MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
> stores 0.
>
> There are two conceptually independent procedures: "the aging", which
> produces young generations, and "the eviction", which consumes old
> generations. They form a closed-loop system, i.e., "the page reclaim".
> Both procedures can be invoked from userspace for the purposes of
> working set estimation and proactive reclaim. These features are
> required to optimize job scheduling (bin packing) in data centers. The
> variable size of the sliding window is designed for such use cases
> [1][2].
>
> To avoid confusion, the terms "hot" and "cold" will be applied to the
> multi-gen LRU, as a new convention; the terms "active" and "inactive"
> will be applied to the active/inactive LRU, as usual.
>
> The protection of hot pages and the selection of cold pages are based
> on page access channels and patterns. There are two access channels:
> one through page tables and the other through file descriptors. The
> protection of the former channel is by design stronger because:
> 1. The uncertainty in determining the access patterns of the former
>    channel is higher due to the approximation of the accessed bit.
> 2. The cost of evicting the former channel is higher due to the TLB
>    flushes required and the likelihood of encountering the dirty bit.
> 3. The penalty of underprotecting the former channel is higher because
>    applications usually do not prepare themselves for major page
>    faults like they do for blocked I/O. E.g., GUI applications
>    commonly use dedicated I/O threads to avoid blocking the rendering
>    threads.
> There are also two access patterns: one with temporal locality and the
> other without. For the reasons listed above, the former channel is
> assumed to follow the former pattern unless VM_SEQ_READ or
> VM_RAND_READ is present; the latter channel is assumed to follow the
> latter pattern unless outlying refaults have been observed.
>
> The next patch will address the "outlying refaults". A few macros,
> i.e., LRU_REFS_*, used later are added in this patch to make the
> patchset less diffy.
>
> A page is added to the youngest generation on faulting. The aging
> needs to check the accessed bit at least twice before handing this
> page over to the eviction. The first check takes care of the accessed
> bit set on the initial fault; the second check makes sure this page
> has not been used since then. This protocol, AKA second chance,
> requires a minimum of two generations, hence MIN_NR_GENS.
>
> [1] https://dl.acm.org/doi/10.1145/3297858.3304053
> [2] https://dl.acm.org/doi/10.1145/3503222.3507731
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  fs/fuse/dev.c                     |   3 +-
>  include/linux/mm.h                |   2 +
>  include/linux/mm_inline.h         | 176 ++++++++++++++++++++++++++++++
>  include/linux/mmzone.h            |  94 ++++++++++++++++
>  include/linux/page-flags-layout.h |  11 +-
>  include/linux/page-flags.h        |   4 +-
>  include/linux/sched.h             |   4 +
>  kernel/bounds.c                   |   7 ++
>  mm/Kconfig                        |  10 ++
>  mm/huge_memory.c                  |   3 +-
>  mm/memcontrol.c                   |   2 +
>  mm/memory.c                       |  25 +++++
>  mm/mm_init.c                      |   6 +-
>  mm/mmzone.c                       |   2 +
>  mm/swap.c                         |   9 +-
>  mm/vmscan.c                       |  73 +++++++++++++
>  16 files changed, 418 insertions(+), 13 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 592730fd6e42..e7c0aa6d61ce 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -785,7 +785,8 @@ static int fuse_check_page(struct page *page)
>                1 << PG_active |
>                1 << PG_workingset |
>                1 << PG_reclaim |
> -              1 << PG_waiters))) {
> +              1 << PG_waiters |
> +              LRU_GEN_MASK | LRU_REFS_MASK))) {
>                 dump_page(page, "fuse: trying to steal weird page");
>                 return 1;
>         }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5744a3fc4716..c1162659d824 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1032,6 +1032,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
>  #define ZONES_PGOFF            (NODES_PGOFF - ZONES_WIDTH)
>  #define LAST_CPUPID_PGOFF      (ZONES_PGOFF - LAST_CPUPID_WIDTH)
>  #define KASAN_TAG_PGOFF                (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
> +#define LRU_GEN_PGOFF          (KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
> +#define LRU_REFS_PGOFF         (LRU_GEN_PGOFF - LRU_REFS_WIDTH)
>
>  /*
>   * Define the bit shifts to access each section.  For non-existent
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 2c24f5ac3e2a..e3594171b421 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -38,6 +38,9 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
>  {
>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>
> +       lockdep_assert_held(&lruvec->lru_lock);
> +       WARN_ON_ONCE(nr_pages != (int)nr_pages);
> +
>         __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
>         __mod_zone_page_state(&pgdat->node_zones[zid],
>                                 NR_ZONE_LRU_BASE + lru, nr_pages);
> @@ -99,11 +102,178 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
>         return lru;
>  }
>
> +#ifdef CONFIG_LRU_GEN
> +
> +static inline bool lru_gen_enabled(void)
> +{
> +       return true;
> +}
> +
> +static inline bool lru_gen_in_fault(void)
> +{
> +       return current->in_lru_fault;
> +}
> +
> +static inline int lru_gen_from_seq(unsigned long seq)
> +{
> +       return seq % MAX_NR_GENS;
> +}
> +
> +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> +{
> +       unsigned long max_seq = lruvec->lrugen.max_seq;
> +
> +       VM_BUG_ON(gen >= MAX_NR_GENS);
> +
> +       /* see the comment on MIN_NR_GENS */
> +       return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> +}
> +
> +static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *folio,
> +                                      int old_gen, int new_gen)
> +{
> +       int type = folio_is_file_lru(folio);
> +       int zone = folio_zonenum(folio);
> +       int delta = folio_nr_pages(folio);
> +       enum lru_list lru = type * LRU_INACTIVE_FILE;
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       VM_BUG_ON(old_gen != -1 && old_gen >= MAX_NR_GENS);
> +       VM_BUG_ON(new_gen != -1 && new_gen >= MAX_NR_GENS);
> +       VM_BUG_ON(old_gen == -1 && new_gen == -1);
> +
> +       if (old_gen >= 0)
> +               WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone],
> +                          lrugen->nr_pages[old_gen][type][zone] - delta);
> +       if (new_gen >= 0)
> +               WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone],
> +                          lrugen->nr_pages[new_gen][type][zone] + delta);
> +
> +       /* addition */
> +       if (old_gen < 0) {
> +               if (lru_gen_is_active(lruvec, new_gen))
> +                       lru += LRU_ACTIVE;
> +               __update_lru_size(lruvec, lru, zone, delta);
> +               return;
> +       }
> +
> +       /* deletion */
> +       if (new_gen < 0) {
> +               if (lru_gen_is_active(lruvec, old_gen))
> +                       lru += LRU_ACTIVE;
> +               __update_lru_size(lruvec, lru, zone, -delta);
> +               return;
> +       }
> +}
> +
> +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       int gen;
> +       unsigned long old_flags, new_flags;
> +       int type = folio_is_file_lru(folio);
> +       int zone = folio_zonenum(folio);
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       if (folio_test_unevictable(folio))
> +               return false;
> +       /*
> +        * There are three common cases for this page:
> +        * 1. If it's hot, e.g., freshly faulted in or previously hot and
> +        *    migrated, add it to the youngest generation.

usually, one page is not active when it is faulted in. till its second
access is detected, it can be active.


> +        * 2. If it's cold but can't be evicted immediately, i.e., an anon page
> +        *    not in swapcache or a dirty page pending writeback, add it to the
> +        *    second oldest generation.
> +        * 3. Everything else (clean, cold) is added to the oldest generation.
> +        */
> +       if (folio_test_active(folio))
> +               gen = lru_gen_from_seq(lrugen->max_seq);
> +       else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
> +                (folio_test_reclaim(folio) &&
> +                 (folio_test_dirty(folio) || folio_test_writeback(folio))))
> +               gen = lru_gen_from_seq(lrugen->min_seq[type] + 1);
> +       else
> +               gen = lru_gen_from_seq(lrugen->min_seq[type]);
> +
> +       do {
> +               new_flags = old_flags = READ_ONCE(folio->flags);
> +               VM_BUG_ON_FOLIO(new_flags & LRU_GEN_MASK, folio);
> +
> +               /* see the comment on MIN_NR_GENS */
> +               new_flags &= ~(LRU_GEN_MASK | BIT(PG_active));
> +               new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
> +       } while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> +
> +       lru_gen_update_size(lruvec, folio, -1, gen);
> +       /* for folio_rotate_reclaimable() */
> +       if (reclaiming)
> +               list_add_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
> +       else
> +               list_add(&folio->lru, &lrugen->lists[gen][type][zone]);
> +
> +       return true;
> +}
> +
> +static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       int gen;
> +       unsigned long old_flags, new_flags;
> +
> +       do {
> +               new_flags = old_flags = READ_ONCE(folio->flags);
> +               if (!(new_flags & LRU_GEN_MASK))
> +                       return false;
> +
> +               VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> +               VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> +
> +               gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> +
> +               new_flags &= ~LRU_GEN_MASK;
> +               /* for shrink_page_list() */
> +               if (reclaiming)
> +                       new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim));
> +               else if (lru_gen_is_active(lruvec, gen))
> +                       new_flags |= BIT(PG_active);
> +       } while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> +
> +       lru_gen_update_size(lruvec, folio, gen, -1);
> +       list_del(&folio->lru);
> +
> +       return true;
> +}
> +
> +#else
> +
> +static inline bool lru_gen_enabled(void)
> +{
> +       return false;
> +}
> +
> +static inline bool lru_gen_in_fault(void)
> +{
> +       return false;
> +}
> +
> +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       return false;
> +}
> +
> +static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       return false;
> +}
> +
> +#endif /* CONFIG_LRU_GEN */
> +
>  static __always_inline
>  void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
>  {
>         enum lru_list lru = folio_lru_list(folio);
>
> +       if (lru_gen_add_folio(lruvec, folio, false))
> +               return;
> +
>         update_lru_size(lruvec, lru, folio_zonenum(folio),
>                         folio_nr_pages(folio));
>         list_add(&folio->lru, &lruvec->lists[lru]);
> @@ -120,6 +290,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
>  {
>         enum lru_list lru = folio_lru_list(folio);
>
> +       if (lru_gen_add_folio(lruvec, folio, true))
> +               return;
> +
>         update_lru_size(lruvec, lru, folio_zonenum(folio),
>                         folio_nr_pages(folio));
>         list_add_tail(&folio->lru, &lruvec->lists[lru]);
> @@ -134,6 +307,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
>  static __always_inline
>  void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
>  {
> +       if (lru_gen_del_folio(lruvec, folio, false))
> +               return;
> +
>         list_del(&folio->lru);
>         update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
>                         -folio_nr_pages(folio));
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index aed44e9b5d89..a88e27d85693 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -303,6 +303,96 @@ enum lruvec_flags {
>                                          */
>  };
>
> +#endif /* !__GENERATING_BOUNDS_H */
> +
> +/*
> + * Evictable pages are divided into multiple generations. The youngest and the
> + * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
> + * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
> + * offset within MAX_NR_GENS, gen, indexes the LRU list of the corresponding
> + * generation. The gen counter in folio->flags stores gen+1 while a page is on
> + * one of lrugen->lists[]. Otherwise it stores 0.
> + *
> + * A page is added to the youngest generation on faulting. The aging needs to
> + * check the accessed bit at least twice before handing this page over to the
> + * eviction. The first check takes care of the accessed bit set on the initial
> + * fault; the second check makes sure this page hasn't been used since then.
> + * This process, AKA second chance, requires a minimum of two generations,
> + * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive
> + * LRU, these two generations are considered active; the rest of generations, if
> + * they exist, are considered inactive. See lru_gen_is_active(). PG_active is
> + * always cleared while a page is on one of lrugen->lists[] so that the aging
> + * needs not to worry about it. And it's set again when a page considered active
> + * is isolated for non-reclaiming purposes, e.g., migration. See
> + * lru_gen_add_folio() and lru_gen_del_folio().
> + *
> + * MAX_NR_GENS is set to 4 so that the multi-gen LRU has twice of the categories
> + * of the active/inactive LRU.
> + *
> + */
> +#define MIN_NR_GENS            2U
> +#define MAX_NR_GENS            4U
> +
> +#ifndef __GENERATING_BOUNDS_H
> +
> +struct lruvec;
> +
> +#define LRU_GEN_MASK           ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> +#define LRU_REFS_MASK          ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)

The commit log said nothing about REFS flags and tiers.
but the code is here. either the commit log lacks something
or the code should belong to the next patch?

> +
> +#ifdef CONFIG_LRU_GEN
> +
> +enum {
> +       LRU_GEN_ANON,
> +       LRU_GEN_FILE,
> +};
> +
> +/*
> + * The youngest generation number is stored in max_seq for both anon and file
> + * types as they are aged on an equal footing. The oldest generation numbers are
> + * stored in min_seq[] separately for anon and file types as clean file pages
> + * can be evicted regardless of swap constraints.
> + *
> + * Normally anon and file min_seq are in sync. But if swapping is constrained,
> + * e.g., out of swap space, file min_seq is allowed to advance and leave anon
> + * min_seq behind.
> + */
> +struct lru_gen_struct {
> +       /* the aging increments the youngest generation number */
> +       unsigned long max_seq;
> +       /* the eviction increments the oldest generation numbers */
> +       unsigned long min_seq[ANON_AND_FILE];
> +       /* the multi-gen LRU lists */
> +       struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> +       /* the sizes of the above lists */
> +       unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> +};
> +
> +void lru_gen_init_lruvec(struct lruvec *lruvec);
> +
> +#ifdef CONFIG_MEMCG
> +void lru_gen_init_memcg(struct mem_cgroup *memcg);
> +void lru_gen_exit_memcg(struct mem_cgroup *memcg);
> +#endif
> +
> +#else /* !CONFIG_LRU_GEN */
> +
> +static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
> +{
> +}
> +
> +#ifdef CONFIG_MEMCG
> +static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
> +{
> +}
> +
> +static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
> +{
> +}
> +#endif
> +
> +#endif /* CONFIG_LRU_GEN */
> +
>  struct lruvec {
>         struct list_head                lists[NR_LRU_LISTS];
>         /* per lruvec lru_lock for memcg */
> @@ -320,6 +410,10 @@ struct lruvec {
>         unsigned long                   refaults[ANON_AND_FILE];
>         /* Various lruvec state flags (enum lruvec_flags) */
>         unsigned long                   flags;
> +#ifdef CONFIG_LRU_GEN
> +       /* evictable pages divided into generations */
> +       struct lru_gen_struct           lrugen;
> +#endif
>  #ifdef CONFIG_MEMCG
>         struct pglist_data *pgdat;
>  #endif
> diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
> index ef1e3e736e14..c1946cdb845f 100644
> --- a/include/linux/page-flags-layout.h
> +++ b/include/linux/page-flags-layout.h
> @@ -55,7 +55,8 @@
>  #define SECTIONS_WIDTH         0
>  #endif
>
> -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
> +#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
> +       <= BITS_PER_LONG - NR_PAGEFLAGS
>  #define NODES_WIDTH            NODES_SHIFT
>  #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #error "Vmemmap: No space for nodes field in page flags"
> @@ -89,8 +90,8 @@
>  #define LAST_CPUPID_SHIFT 0
>  #endif
>
> -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
> -       <= BITS_PER_LONG - NR_PAGEFLAGS
> +#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
> +       KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
>  #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
>  #else
>  #define LAST_CPUPID_WIDTH 0
> @@ -100,8 +101,8 @@
>  #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
>  #endif
>
> -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
> -       > BITS_PER_LONG - NR_PAGEFLAGS
> +#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
> +       KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
>  #error "Not enough bits in page flags"
>  #endif
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 1c3b6e5c8bfd..a95518ca98eb 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -935,7 +935,7 @@ __PAGEFLAG(Isolated, isolated, PF_ANY);
>          1UL << PG_private      | 1UL << PG_private_2   |       \
>          1UL << PG_writeback    | 1UL << PG_reserved    |       \
>          1UL << PG_slab         | 1UL << PG_active      |       \
> -        1UL << PG_unevictable  | __PG_MLOCKED)
> +        1UL << PG_unevictable  | __PG_MLOCKED | LRU_GEN_MASK)
>
>  /*
>   * Flags checked when a page is prepped for return by the page allocator.
> @@ -946,7 +946,7 @@ __PAGEFLAG(Isolated, isolated, PF_ANY);
>   * alloc-free cycle to prevent from reusing the page.
>   */
>  #define PAGE_FLAGS_CHECK_AT_PREP       \
> -       (PAGEFLAGS_MASK & ~__PG_HWPOISON)
> +       ((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
>
>  #define PAGE_FLAGS_PRIVATE                             \
>         (1UL << PG_private | 1UL << PG_private_2)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 75ba8aa60248..e7fe784b11aa 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -914,6 +914,10 @@ struct task_struct {
>  #ifdef CONFIG_MEMCG
>         unsigned                        in_user_fault:1;
>  #endif
> +#ifdef CONFIG_LRU_GEN
> +       /* whether the LRU algorithm may apply to this access */
> +       unsigned                        in_lru_fault:1;
> +#endif
>  #ifdef CONFIG_COMPAT_BRK
>         unsigned                        brk_randomized:1;
>  #endif
> diff --git a/kernel/bounds.c b/kernel/bounds.c
> index 9795d75b09b2..e08fb89f87f4 100644
> --- a/kernel/bounds.c
> +++ b/kernel/bounds.c
> @@ -22,6 +22,13 @@ int main(void)
>         DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
>  #endif
>         DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
> +#ifdef CONFIG_LRU_GEN
> +       DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
> +       DEFINE(LRU_REFS_WIDTH, 0);
> +#else
> +       DEFINE(LRU_GEN_WIDTH, 0);
> +       DEFINE(LRU_REFS_WIDTH, 0);
> +#endif
>         /* End of constants */
>
>         return 0;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 3326ee3903f3..747ab1690bcf 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -892,6 +892,16 @@ config ANON_VMA_NAME
>           area from being merged with adjacent virtual memory areas due to the
>           difference in their name.
>
> +# the multi-gen LRU {
> +config LRU_GEN
> +       bool "Multi-Gen LRU"
> +       depends on MMU
> +       # the following options can use up the spare bits in page flags
> +       depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> +       help
> +         A high performance LRU implementation for memory overcommit.
> +# }
> +
>  source "mm/damon/Kconfig"
>
>  endmenu
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 406a3c28c026..3df389fd307f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2364,7 +2364,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  #ifdef CONFIG_64BIT
>                          (1L << PG_arch_2) |
>  #endif
> -                        (1L << PG_dirty)));
> +                        (1L << PG_dirty) |
> +                        LRU_GEN_MASK | LRU_REFS_MASK));
>
>         /* ->mapping in first tail page is compound_mapcount */
>         VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 36e9f38c919d..3fcbfeda259b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5121,6 +5121,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
>
>  static void mem_cgroup_free(struct mem_cgroup *memcg)
>  {
> +       lru_gen_exit_memcg(memcg);
>         memcg_wb_domain_exit(memcg);
>         __mem_cgroup_free(memcg);
>  }
> @@ -5180,6 +5181,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>         memcg->deferred_split_queue.split_queue_len = 0;
>  #endif
>         idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
> +       lru_gen_init_memcg(memcg);
>         return memcg;
>  fail:
>         mem_cgroup_id_remove(memcg);
> diff --git a/mm/memory.c b/mm/memory.c
> index a7379196a47e..d27e5f1a2533 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4754,6 +4754,27 @@ static inline void mm_account_fault(struct pt_regs *regs,
>                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
>  }
>
> +#ifdef CONFIG_LRU_GEN
> +static void lru_gen_enter_fault(struct vm_area_struct *vma)
> +{
> +       /* the LRU algorithm doesn't apply to sequential or random reads */
> +       current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
> +}
> +
> +static void lru_gen_exit_fault(void)
> +{
> +       current->in_lru_fault = false;
> +}
> +#else
> +static void lru_gen_enter_fault(struct vm_area_struct *vma)
> +{
> +}
> +
> +static void lru_gen_exit_fault(void)
> +{
> +}
> +#endif /* CONFIG_LRU_GEN */
> +
>  /*
>   * By the time we get here, we already hold the mm semaphore
>   *
> @@ -4785,11 +4806,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>         if (flags & FAULT_FLAG_USER)
>                 mem_cgroup_enter_user_fault();
>
> +       lru_gen_enter_fault(vma);
> +
>         if (unlikely(is_vm_hugetlb_page(vma)))
>                 ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
>         else
>                 ret = __handle_mm_fault(vma, address, flags);
>
> +       lru_gen_exit_fault();
> +
>         if (flags & FAULT_FLAG_USER) {
>                 mem_cgroup_exit_user_fault();
>                 /*
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 9ddaf0e1b0ab..0d7b2bd2454a 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void)
>
>         shift = 8 * sizeof(unsigned long);
>         width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
> -               - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
> +               - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH;
>         mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
> -               "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
> +               "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
>                 SECTIONS_WIDTH,
>                 NODES_WIDTH,
>                 ZONES_WIDTH,
>                 LAST_CPUPID_WIDTH,
>                 KASAN_TAG_WIDTH,
> +               LRU_GEN_WIDTH,
> +               LRU_REFS_WIDTH,
>                 NR_PAGEFLAGS);
>         mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
>                 "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index eb89d6e018e2..2ec0d7793424 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -81,6 +81,8 @@ void lruvec_init(struct lruvec *lruvec)
>
>         for_each_lru(lru)
>                 INIT_LIST_HEAD(&lruvec->lists[lru]);
> +
> +       lru_gen_init_lruvec(lruvec);
>  }
>
>  #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
> diff --git a/mm/swap.c b/mm/swap.c
> index bcf3ac288b56..e5f2ab3dab4a 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
>         VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
>         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>
> +       /* see the comment in lru_gen_add_folio() */
> +       if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> +           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> +               folio_set_active(folio);

So here is our magic to make folio active as long as it is
faulted in? i really don't think the below comment is good,
could we say our purpose directly and explicitly?

 /* see the comment in lru_gen_add_folio() */

> +
>         folio_get(folio);
>         local_lock(&lru_pvecs.lock);
>         pvec = this_cpu_ptr(&lru_pvecs.lru_add);
> @@ -563,7 +568,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
>
>  static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
>  {
> -       if (PageActive(page) && !PageUnevictable(page)) {
> +       if (!PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
>                 int nr_pages = thp_nr_pages(page);
>
>                 del_page_from_lru_list(page, lruvec);
> @@ -677,7 +682,7 @@ void deactivate_file_page(struct page *page)
>   */
>  void deactivate_page(struct page *page)
>  {
> -       if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> +       if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
>                 struct pagevec *pvec;
>
>                 local_lock(&lru_pvecs.lock);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8e744cdf802f..65eb668abf2d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3042,6 +3042,79 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
>         return can_demote(pgdat->node_id, sc);
>  }
>
> +#ifdef CONFIG_LRU_GEN
> +
> +/******************************************************************************
> + *                          shorthand helpers
> + ******************************************************************************/
> +
> +#define for_each_gen_type_zone(gen, type, zone)                                \
> +       for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)                   \
> +               for ((type) = 0; (type) < ANON_AND_FILE; (type)++)      \
> +                       for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
> +
> +static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
> +{
> +       struct pglist_data *pgdat = NODE_DATA(nid);
> +
> +#ifdef CONFIG_MEMCG
> +       if (memcg) {
> +               struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec;
> +
> +               /* for hotadd_new_pgdat() */
> +               if (!lruvec->pgdat)
> +                       lruvec->pgdat = pgdat;
> +
> +               return lruvec;
> +       }
> +#endif
> +       return pgdat ? &pgdat->__lruvec : NULL;
> +}
> +
> +/******************************************************************************
> + *                          initialization
> + ******************************************************************************/
> +
> +void lru_gen_init_lruvec(struct lruvec *lruvec)
> +{
> +       int gen, type, zone;
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       lrugen->max_seq = MIN_NR_GENS + 1;
> +
> +       for_each_gen_type_zone(gen, type, zone)
> +               INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
> +}
> +
> +#ifdef CONFIG_MEMCG
> +void lru_gen_init_memcg(struct mem_cgroup *memcg)
> +{
> +}
> +
> +void lru_gen_exit_memcg(struct mem_cgroup *memcg)
> +{
> +       int nid;
> +
> +       for_each_node(nid) {
> +               struct lruvec *lruvec = get_lruvec(memcg, nid);
> +
> +               VM_BUG_ON(memchr_inv(lruvec->lrugen.nr_pages, 0,
> +                                    sizeof(lruvec->lrugen.nr_pages)));
> +       }
> +}
> +#endif
> +
> +static int __init init_lru_gen(void)
> +{
> +       BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
> +       BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
> +
> +       return 0;
> +};
> +late_initcall(init_lru_gen);
> +
> +#endif /* CONFIG_LRU_GEN */
> +
>  static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
>         unsigned long nr[NR_LRU_LISTS];
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

Yu Zhao March 21, 2022, 9:04 a.m. UTC | #6

On Wed, Mar 16, 2022 at 5:25 PM Barry Song <21cnbao@gmail.com> wrote:
>
...
> > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > +{
> > +       int gen;
> > +       unsigned long old_flags, new_flags;
> > +       int type = folio_is_file_lru(folio);
> > +       int zone = folio_zonenum(folio);
> > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +       if (folio_test_unevictable(folio))
> > +               return false;
> > +       /*
> > +        * There are three common cases for this page:
> > +        * 1. If it's hot, e.g., freshly faulted in or previously hot and
> > +        *    migrated, add it to the youngest generation.
>
> usually, one page is not active when it is faulted in. till its second
> access is detected, it can be active.

The active/inactive LRU *assumes* this; MGLRU *assumes* the opposite,
and there is no "active" in MGLRU -- we call it hot to avoid confusion
:)

> > +        * 2. If it's cold but can't be evicted immediately, i.e., an anon page
> > +        *    not in swapcache or a dirty page pending writeback, add it to the
> > +        *    second oldest generation.
> > +        * 3. Everything else (clean, cold) is added to the oldest generation.
> > +        */
...
> > +#define LRU_GEN_MASK           ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> > +#define LRU_REFS_MASK          ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
>
> The commit log said nothing about REFS flags and tiers.
> but the code is here. either the commit log lacks something
> or the code should belong to the next patch?

It did:
  A few macros, i.e., LRU_REFS_*, used later are added in this patch
to make the patchset less diffy.

> > @@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
> >         VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
> >         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> >
> > +       /* see the comment in lru_gen_add_folio() */
> > +       if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> > +           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> > +               folio_set_active(folio);
>
> So here is our magic to make folio active as long as it is
> faulted in? i really don't think the below comment is good,
> could we say our purpose directly and explicitly?
>
>  /* see the comment in lru_gen_add_folio() */

I generally keep comments in a few major locations and reference them
from many other minior locations so that it's easier to manage in the
long run. It is a hassle for reviews but once in the tree you can jump
to lru_gen_add_folio() with ctags/cscope or find all places that
reference it by grepping. Assuming we state the purpose, which is to
make lru_gen_add_folio() add the page to the youngest generation, you
still want to go to lru_gen_add_folio() to check if this is really the
case. So why bother :)

Barry Song March 21, 2022, 11:47 a.m. UTC | #7

On Mon, Mar 21, 2022 at 10:04 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Mar 16, 2022 at 5:25 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> ...
> > > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > +{
> > > +       int gen;
> > > +       unsigned long old_flags, new_flags;
> > > +       int type = folio_is_file_lru(folio);
> > > +       int zone = folio_zonenum(folio);
> > > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > +
> > > +       if (folio_test_unevictable(folio))
> > > +               return false;
> > > +       /*
> > > +        * There are three common cases for this page:
> > > +        * 1. If it's hot, e.g., freshly faulted in or previously hot and
> > > +        *    migrated, add it to the youngest generation.
> >
> > usually, one page is not active when it is faulted in. till its second
> > access is detected, it can be active.
>
> The active/inactive LRU *assumes* this; MGLRU *assumes* the opposite,
> and there is no "active" in MGLRU -- we call it hot to avoid confusion
> :)

yep.

>
> > > +        * 2. If it's cold but can't be evicted immediately, i.e., an anon page
> > > +        *    not in swapcache or a dirty page pending writeback, add it to the
> > > +        *    second oldest generation.
> > > +        * 3. Everything else (clean, cold) is added to the oldest generation.
> > > +        */
> ...
> > > +#define LRU_GEN_MASK           ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> > > +#define LRU_REFS_MASK          ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
> >
> > The commit log said nothing about REFS flags and tiers.
> > but the code is here. either the commit log lacks something
> > or the code should belong to the next patch?
>
> It did:
>   A few macros, i.e., LRU_REFS_*, used later are added in this patch
> to make the patchset less diffy.

sorry for missing that.

>
> > > @@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
> > >         VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
> > >         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> > >
> > > +       /* see the comment in lru_gen_add_folio() */
> > > +       if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> > > +           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> > > +               folio_set_active(folio);
> >
> > So here is our magic to make folio active as long as it is
> > faulted in? i really don't think the below comment is good,
> > could we say our purpose directly and explicitly?
> >
> >  /* see the comment in lru_gen_add_folio() */
>
> I generally keep comments in a few major locations and reference them
> from many other minior locations so that it's easier to manage in the
> long run. It is a hassle for reviews but once in the tree you can jump
> to lru_gen_add_folio() with ctags/cscope or find all places that
> reference it by grepping. Assuming we state the purpose, which is to
> make lru_gen_add_folio() add the page to the youngest generation, you
> still want to go to lru_gen_add_folio() to check if this is really the
> case. So why bother :)

well understood though my pain was that I needed to email you to get
confirmed this is really the case.

Thanks
Barry

Justin Forbes March 21, 2022, 6:58 p.m. UTC | #8

On Mon, Mar 14, 2022 at 4:30 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Hi, Yu,
> >
> > Yu Zhao <yuzhao@google.com> writes:
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 3326ee3903f3..747ab1690bcf 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -892,6 +892,16 @@ config ANON_VMA_NAME
> > >         area from being merged with adjacent virtual memory areas due to the
> > >         difference in their name.
> > >
> > > +# the multi-gen LRU {
> > > +config LRU_GEN
> > > +     bool "Multi-Gen LRU"
> > > +     depends on MMU
> > > +     # the following options can use up the spare bits in page flags
> > > +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> >
> > LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
> > by LRU_GEN?
>
> LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
> max number. The dependency is with NODES_SHIFT selected by MAXSMP:
>     default "10" if MAXSMP
> This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.
>
> MAXSMP is meant for kernel developers to test their code, and it
> should not be used in production [1]. But some distros unfortunately
> ship kernels built with this option, e.g., Fedora and Ubuntu. And
> their users reported build errors to me after they applied MGLRU on
> those kernels ("Not enough bits in page flags"). Let me add Fedora and
> Ubuntu to this thread.
>
> Fedora and Ubuntu,
>
> Could you please clarify if there is a reason to ship kernels built
> with MAXSMP? Otherwise, please consider disabling this option. Thanks.
>
> As per above, MAXSMP enables ridiculously large numbers of CPUs and
> NUMA nodes for testing purposes. It is detrimental to performance,
> e.g., CPUMASK_OFFSTACK.

It was enabled for Fedora, and RHEL because we did need more than 512
CPUs, originally only in RHEL until SGI (years ago) complained that
they were testing very large machines with Fedora.  The testing done
on RHEL showed that the performance impact was minimal.   For a very
long time we had MAXSMP off and carried a patch which allowed us to
turn on CPUMASK_OFFSTACK without debugging because there was supposed
to be "something else" coming.  In 2019 we gave up, dropped that patch
and just turned on MAXSMP.

I do not have any metrics for how often someone runs Fedora on a
ridiculously large machine these days, but I would guess that number
is not 0.

Justin

> [1] https://lore.kernel.org/lkml/20131106055634.GA24044@gmail.com/
>

Prarit Bhargava March 21, 2022, 7:17 p.m. UTC | #9

On 3/21/22 14:58, Justin Forbes wrote:
> On Mon, Mar 14, 2022 at 4:30 AM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
>>>
>>> Hi, Yu,
>>>
>>> Yu Zhao <yuzhao@google.com> writes:
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index 3326ee3903f3..747ab1690bcf 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -892,6 +892,16 @@ config ANON_VMA_NAME
>>>>          area from being merged with adjacent virtual memory areas due to the
>>>>          difference in their name.
>>>>
>>>> +# the multi-gen LRU {
>>>> +config LRU_GEN
>>>> +     bool "Multi-Gen LRU"
>>>> +     depends on MMU
>>>> +     # the following options can use up the spare bits in page flags
>>>> +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
>>>
>>> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
>>> by LRU_GEN?
>>
>> LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
>> max number. The dependency is with NODES_SHIFT selected by MAXSMP:
>>      default "10" if MAXSMP
>> This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.
>>
>> MAXSMP is meant for kernel developers to test their code, and it
>> should not be used in production [1]. But some distros unfortunately
>> ship kernels built with this option, e.g., Fedora and Ubuntu. And
>> their users reported build errors to me after they applied MGLRU on
>> those kernels ("Not enough bits in page flags"). Let me add Fedora and
>> Ubuntu to this thread.
>>
>> Fedora and Ubuntu,
>>
>> Could you please clarify if there is a reason to ship kernels built
>> with MAXSMP? Otherwise, please consider disabling this option. Thanks.
>>
>> As per above, MAXSMP enables ridiculously large numbers of CPUs and
>> NUMA nodes for testing purposes. It is detrimental to performance,
>> e.g., CPUMASK_OFFSTACK.
> 
> It was enabled for Fedora, and RHEL because we did need more than 512
> CPUs, originally only in RHEL until SGI (years ago) complained that
> they were testing very large machines with Fedora.  The testing done
> on RHEL showed that the performance impact was minimal.   For a very
> long time we had MAXSMP off and carried a patch which allowed us to
> turn on CPUMASK_OFFSTACK without debugging because there was supposed
> to be "something else" coming.  In 2019 we gave up, dropped that patch
> and just turned on MAXSMP.
> 
> I do not have any metrics for how often someone runs Fedora on a
> ridiculously large machine these days, but I would guess that number
> is not 0.

It is not 0.  I've seen data from large systems (1000+ logical threads) 
that are running Fedora albeit with a modified Fedora kernel.

Additionally the max limit for CPUS in RHEL is 1792, however, we have 
recently had a request to *double* that to 3584.  You should just assume 
that number will continue to increase.

P.


> 
> Justin
> 
>> [1] https://lore.kernel.org/lkml/20131106055634.GA24044@gmail.com/
>>
> _______________________________________________
> kernel mailing list -- kernel@lists.fedoraproject.org
> To unsubscribe send an email to kernel-leave@lists.fedoraproject.org
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/kernel@lists.fedoraproject.org
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

Yu Zhao March 22, 2022, 4:52 a.m. UTC | #10

On Mon, Mar 21, 2022 at 1:18 PM Prarit Bhargava <prarit@redhat.com> wrote:
>
> On 3/21/22 14:58, Justin Forbes wrote:
> > On Mon, Mar 14, 2022 at 4:30 AM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>>
> >>> Hi, Yu,
> >>>
> >>> Yu Zhao <yuzhao@google.com> writes:
> >>>> diff --git a/mm/Kconfig b/mm/Kconfig
> >>>> index 3326ee3903f3..747ab1690bcf 100644
> >>>> --- a/mm/Kconfig
> >>>> +++ b/mm/Kconfig
> >>>> @@ -892,6 +892,16 @@ config ANON_VMA_NAME
> >>>>          area from being merged with adjacent virtual memory areas due to the
> >>>>          difference in their name.
> >>>>
> >>>> +# the multi-gen LRU {
> >>>> +config LRU_GEN
> >>>> +     bool "Multi-Gen LRU"
> >>>> +     depends on MMU
> >>>> +     # the following options can use up the spare bits in page flags
> >>>> +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> >>>
> >>> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
> >>> by LRU_GEN?
> >>
> >> LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
> >> max number. The dependency is with NODES_SHIFT selected by MAXSMP:
> >>      default "10" if MAXSMP
> >> This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.
> >>
> >> MAXSMP is meant for kernel developers to test their code, and it
> >> should not be used in production [1]. But some distros unfortunately
> >> ship kernels built with this option, e.g., Fedora and Ubuntu. And
> >> their users reported build errors to me after they applied MGLRU on
> >> those kernels ("Not enough bits in page flags"). Let me add Fedora and
> >> Ubuntu to this thread.
> >>
> >> Fedora and Ubuntu,
> >>
> >> Could you please clarify if there is a reason to ship kernels built
> >> with MAXSMP? Otherwise, please consider disabling this option. Thanks.
> >>
> >> As per above, MAXSMP enables ridiculously large numbers of CPUs and
> >> NUMA nodes for testing purposes. It is detrimental to performance,
> >> e.g., CPUMASK_OFFSTACK.
> >
> > It was enabled for Fedora, and RHEL because we did need more than 512
> > CPUs, originally only in RHEL until SGI (years ago) complained that
> > they were testing very large machines with Fedora.  The testing done
> > on RHEL showed that the performance impact was minimal.   For a very
> > long time we had MAXSMP off and carried a patch which allowed us to
> > turn on CPUMASK_OFFSTACK without debugging because there was supposed
> > to be "something else" coming.  In 2019 we gave up, dropped that patch
> > and just turned on MAXSMP.
> >
> > I do not have any metrics for how often someone runs Fedora on a
> > ridiculously large machine these days, but I would guess that number
> > is not 0.
>
> It is not 0.  I've seen data from large systems (1000+ logical threads)
> that are running Fedora albeit with a modified Fedora kernel.
>
> Additionally the max limit for CPUS in RHEL is 1792, however, we have
> recently had a request to *double* that to 3584.  You should just assume
> that number will continue to increase.

Good to know. Thanks.

From the standpoint of overhead, I'd consider NR_CPUS=4096 and
NODES_SHIFT=7 as the next step, before going with MAXSMP.

[v9,05/14] mm: multi-gen LRU: groundwork

Commit Message

Comments

Patch