diff mbox series

[v7,05/12] mm: multigenerational LRU: minimal implementation

Message ID 20220208081902.3550911-6-yuzhao@google.com (mailing list archive)
State New
Headers show
Series Multigenerational LRU Framework | expand

Commit Message

Yu Zhao Feb. 8, 2022, 8:18 a.m. UTC
To avoid confusions, the terms "promotion" and "demotion" will be
applied to the multigenerational LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.

The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. Since the aging is only
interested in hot pages, its complexity is O(nr_hot_pages). Promotion
in the aging path doesn't require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless
as the result of the increment of max_seq, requires LRU list
operations, e.g., lru_deactivate_fn().

The eviction consumes old generations. Given an lruvec, it increments
min_seq when the list indexed by min_seq%MAX_NR_GENS becomes empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both are
available from the same generation.

Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers don't have dedicated lrugen->lists[], only bits
in folio->flags.  In contrast to moving across generations which
requires the LRU lock, moving between tiers only involves operations
on folio->flags. The feedback loop also monitors refaults over all
tiers and decides when to promote pages in which tiers (N>1), using
the first tier (N=0,1) as a baseline. The first tier contains
single-use unmapped clean pages, which are most likely the best
choices. The eviction promotes a page to the next generation, i.e.,
min_seq+1 rather than max_seq, if the feedback loop decides so. This
approach has the following advantages:
1) It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth promoting in the
   eviction path.
2) It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier since N=0.)
3) More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[47, 49]%
                IOPS         BW
      5.17-rc2: 2242k        8759MiB/s
      patch1-5: 3321k        12.7GiB/s

  Single workload:
    memcached (anon): +[101, 105]%
                Ops/sec      KB/sec
      5.17-rc2: 476771.79    18544.31
      patch1-5: 972526.07    37826.95

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was used as a ram disk only to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<OEF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    OEF

    cat >>/etc/memcached.conf <<OEF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    OEF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.17-rc2
      38.05%  page_vma_mapped_walk
      20.86%  lzo1x_1_do_compress (real work)
       6.16%  do_raw_spin_lock
       4.61%  _raw_spin_unlock_irq
       2.20%  vma_interval_tree_iter_next
       2.19%  vma_interval_tree_subtree_search
       2.15%  page_referenced_one
       1.93%  anon_vma_interval_tree_iter_first
       1.65%  ptep_clear_flush
       1.00%  __zram_bvec_write

    patch1-5
      39.73%  lzo1x_1_do_compress (real work)
      14.96%  page_vma_mapped_walk
       6.97%  _raw_spin_unlock_irq
       3.07%  do_raw_spin_lock
       2.53%  anon_vma_interval_tree_iter_first
       2.04%  ptep_clear_flush
       1.82%  __zram_bvec_write
       1.76%  __anon_vma_interval_tree_subtree_search
       1.57%  memmove
       1.45%  free_unref_page_list

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    Chrome OS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 include/linux/mm.h        |   1 +
 include/linux/mm_inline.h |  15 +
 include/linux/mmzone.h    |  35 ++
 mm/Kconfig                |  44 +++
 mm/swap.c                 |  46 ++-
 mm/vmscan.c               | 784 +++++++++++++++++++++++++++++++++++++-
 mm/workingset.c           | 119 +++++-
 7 files changed, 1039 insertions(+), 5 deletions(-)

Comments

Yu Zhao Feb. 8, 2022, 8:33 a.m. UTC | #1
On Tue, Feb 08, 2022 at 01:18:55AM -0700, Yu Zhao wrote:

<snipped>

> diff --git a/mm/Kconfig b/mm/Kconfig
> index 3326ee3903f3..e899623d5df0 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -892,6 +892,50 @@ config ANON_VMA_NAME
>  	  area from being merged with adjacent virtual memory areas due to the
>  	  difference in their name.
>  
> +# multigenerational LRU {
> +config LRU_GEN
> +	bool "Multigenerational LRU"
> +	depends on MMU
> +	# the following options can use up the spare bits in page flags
> +	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> +	help
> +	  A high performance LRU implementation for memory overcommit. See
> +	  Documentation/admin-guide/mm/multigen_lru.rst and
> +	  Documentation/vm/multigen_lru.rst for details.
> +
> +config NR_LRU_GENS
> +	int "Max number of generations"
> +	depends on LRU_GEN
> +	range 4 31
> +	default 4
> +	help
> +	  Do not increase this value unless you plan to use working set
> +	  estimation and proactive reclaim to optimize job scheduling in data
> +	  centers.
> +
> +	  This option uses order_base_2(N+1) bits in page flags.
> +
> +config TIERS_PER_GEN
> +	int "Number of tiers per generation"
> +	depends on LRU_GEN
> +	range 2 4
> +	default 4
> +	help
> +	  Do not decrease this value unless you run out of spare bits in page
> +	  flags, i.e., you see the "Not enough bits in page flags" build error.
> +
> +	  This option uses N-2 bits in page flags.

Moved Kconfig to this patch as suggested by:
https://lore.kernel.org/linux-mm/Yd6uHYtjGfgqjDpw@dhcp22.suse.cz/

Added two new macros as requested here:
https://lore.kernel.org/linux-mm/87czkyzhfe.fsf@linux.ibm.com/

<snipped>

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d75a5738d1dc..5f0d92838712 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1285,9 +1285,11 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>  
>  	if (PageSwapCache(page)) {
>  		swp_entry_t swap = { .val = page_private(page) };
> -		mem_cgroup_swapout(page, swap);
> +
> +		/* get a shadow entry before mem_cgroup_swapout() clears folio_memcg() */
>  		if (reclaimed && !mapping_exiting(mapping))
>  			shadow = workingset_eviction(page, target_memcg);
> +		mem_cgroup_swapout(page, swap);
>  		__delete_from_swap_cache(page, swap, shadow);
>  		xa_unlock_irq(&mapping->i_pages);
>  		put_swap_page(page, swap);
> @@ -2721,6 +2723,9 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
>  	unsigned long file;
>  	struct lruvec *target_lruvec;
>  
> +	if (lru_gen_enabled())
> +		return;
> +
>  	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
>  
>  	/*
> @@ -3042,15 +3047,47 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
>  
>  #ifdef CONFIG_LRU_GEN
>  
> +enum {
> +	TYPE_ANON,
> +	TYPE_FILE,
> +};

Added two new macros as requested here:
https://lore.kernel.org/linux-mm/87czkyzhfe.fsf@linux.ibm.com/

<snipped>

> +static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> +{
> +	bool need_aging;
> +	long nr_to_scan;
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +	int swappiness = get_swappiness(memcg);
> +	DEFINE_MAX_SEQ(lruvec);
> +	DEFINE_MIN_SEQ(lruvec);
> +
> +	mem_cgroup_calculate_protection(NULL, memcg);
> +
> +	if (mem_cgroup_below_min(memcg))
> +		return;

Added mem_cgroup_calculate_protection() for readability as requested here:
https://lore.kernel.org/linux-mm/Ydf9RXPch5ddg%2FWC@dhcp22.suse.cz/

<snipped>
Johannes Weiner Feb. 8, 2022, 4:50 p.m. UTC | #2
Hi Yu,

Thanks for restructuring this from the last version. It's easier to
learn the new model when you start out with the bare bones, then let
optimizations and self-contained features follow later.

On Tue, Feb 08, 2022 at 01:18:55AM -0700, Yu Zhao wrote:
> To avoid confusions, the terms "promotion" and "demotion" will be
> applied to the multigenerational LRU, as a new convention; the terms
> "activation" and "deactivation" will be applied to the active/inactive
> LRU, as usual.
> 
> The aging produces young generations. Given an lruvec, it increments
> max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
> promotes hot pages to the youngest generation when it finds them
> accessed through page tables; the demotion of cold pages happens
> consequently when it increments max_seq. Since the aging is only
> interested in hot pages, its complexity is O(nr_hot_pages). Promotion
> in the aging path doesn't require any LRU list operations, only the
> updates of the gen counter and lrugen->nr_pages[]; demotion, unless
> as the result of the increment of max_seq, requires LRU list
> operations, e.g., lru_deactivate_fn().

I'm having trouble with this changelog. It opens with a footnote and
summarizes certain aspects of the implementation whose importance to
the reader aren't entirely clear at this time.

It would be better to start with a high-level overview of the problem
and how this algorithm solves it. How the reclaim algorithm needs to
find the page that is most suitable for eviction and to signal when
it's time to give up and OOM. Then explain how grouping pages into
multiple generations accomplishes that - in particular compared to the
current two use-once/use-many lists.

Explain the problem of MMU vs syscall references, and how tiering
addresses this.

Explain the significance of refaults and how the algorithm responds to
them. Not in terms of which running averages are updated, but in terms
of user-visible behavior ("will start swapping (more)" etc.)

Express *intent*, how it's supposed to behave wrt workloads and memory
pressure. The code itself will explain the how, its complexity etc.

Most reviewers will understand the fundamental challenges of page
reclaim. The difficulty is matching individual aspects of the problem
space to your individual components and design choices you have made.

Let us in on that thinking, please ;)

> @@ -892,6 +892,50 @@ config ANON_VMA_NAME
>  	  area from being merged with adjacent virtual memory areas due to the
>  	  difference in their name.
>  
> +# multigenerational LRU {
> +config LRU_GEN
> +	bool "Multigenerational LRU"
> +	depends on MMU
> +	# the following options can use up the spare bits in page flags
> +	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> +	help
> +	  A high performance LRU implementation for memory overcommit. See
> +	  Documentation/admin-guide/mm/multigen_lru.rst and
> +	  Documentation/vm/multigen_lru.rst for details.

These files don't exist at this time, please introduce them before or
when referencing them. If they document things introduced later in the
patchset, please start with a minimal version of the file and update
it as you extend the algorithm and add optimizations etc.

It's really important to only reference previous patches, not later
ones. This allows reviewers to read the patches linearly.  Having to
search for missing pieces in patches you haven't looked at yet is bad.

> +config NR_LRU_GENS
> +	int "Max number of generations"
> +	depends on LRU_GEN
> +	range 4 31
> +	default 4
> +	help
> +	  Do not increase this value unless you plan to use working set
> +	  estimation and proactive reclaim to optimize job scheduling in data
> +	  centers.
> +
> +	  This option uses order_base_2(N+1) bits in page flags.
> +
> +config TIERS_PER_GEN
> +	int "Number of tiers per generation"
> +	depends on LRU_GEN
> +	range 2 4
> +	default 4
> +	help
> +	  Do not decrease this value unless you run out of spare bits in page
> +	  flags, i.e., you see the "Not enough bits in page flags" build error.
> +
> +	  This option uses N-2 bits in page flags.

Linus had pointed out that we shouldn't ask these questions of the
user. How do you pick numbers here? I'm familiar with workingset
estimation and proactive reclaim usecases but I wouldn't know.

Even if we removed the config option and hardcoded the number, this is
a question for kernel developers: What does "4" mean? How would
behavior differ if it were 3 or 5 instead? Presumably there is some
sort of behavior gradient. "As you increase the number of
generations/tiers, the user-visible behavior of the kernel will..."
This should really be documented.

I'd also reiterate Mel's point: Distribution kernels need to support
the full spectrum of applications and production environments. Unless
using non-defaults it's an extremely niche usecase (like compiling out
BUG() calls) compile-time options are not the right choice. If we do
need a tunable, it could make more sense to have a compile time upper
limit (to determine page flag space) combined with a runtime knob?

Thanks!
Yu Zhao Feb. 10, 2022, 2:53 a.m. UTC | #3
On Tue, Feb 08, 2022 at 11:50:09AM -0500, Johannes Weiner wrote:

<snipped>

> On Tue, Feb 08, 2022 at 01:18:55AM -0700, Yu Zhao wrote:
> > To avoid confusions, the terms "promotion" and "demotion" will be
> > applied to the multigenerational LRU, as a new convention; the terms
> > "activation" and "deactivation" will be applied to the active/inactive
> > LRU, as usual.
> > 
> > The aging produces young generations. Given an lruvec, it increments
> > max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
> > promotes hot pages to the youngest generation when it finds them
> > accessed through page tables; the demotion of cold pages happens
> > consequently when it increments max_seq. Since the aging is only
> > interested in hot pages, its complexity is O(nr_hot_pages). Promotion
> > in the aging path doesn't require any LRU list operations, only the
> > updates of the gen counter and lrugen->nr_pages[]; demotion, unless
> > as the result of the increment of max_seq, requires LRU list
> > operations, e.g., lru_deactivate_fn().
> 
> I'm having trouble with this changelog. It opens with a footnote and
> summarizes certain aspects of the implementation whose importance to
> the reader aren't entirely clear at this time.
> 
> It would be better to start with a high-level overview of the problem
> and how this algorithm solves it. How the reclaim algorithm needs to
> find the page that is most suitable for eviction and to signal when
> it's time to give up and OOM. Then explain how grouping pages into
> multiple generations accomplishes that - in particular compared to the
> current two use-once/use-many lists.

Hi Johannes,

Thanks for reviewing!

I suspect the information you are looking for might have been in the
patchset but is scattered in a few places. Could you please glance at
the following pieces and let me know
  1. whether they cover some of the points you asked for
  2. and if so, whether there is a better order/place to present them?

The previous patch has a quick view on the architecture:
https://lore.kernel.org/linux-mm/20220208081902.3550911-5-yuzhao@google.com/

  Evictable pages are divided into multiple generations for each lruvec.
  The youngest generation number is stored in lrugen->max_seq for both
  anon and file types as they're aged on an equal footing. The oldest
  generation numbers are stored in lrugen->min_seq[] separately for anon
  and file types as clean file pages can be evicted regardless of swap
  constraints. These three variables are monotonically increasing.
  
  Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
  in order to fit into the gen counter in folio->flags. Each truncated
  generation number is an index to lrugen->lists[]. The sliding window
  technique is used to track at least MIN_NR_GENS and at most
  MAX_NR_GENS generations. The gen counter stores (seq%MAX_NR_GENS)+1
  while a page is on one of lrugen->lists[]. Otherwise it stores 0.
  
  There are two conceptually independent processes (as in the
  manufacturing process): "the aging", which produces young generations,
  and "the eviction", which consumes old generations. They form a
  closed-loop system, i.e., "the page reclaim". Both processes can be
  invoked from userspace for the purposes of working set estimation and
  proactive reclaim. These features are required to optimize job
  scheduling (bin packing) in data centers. The variable size of the
  sliding window is designed for such use cases...

And the design doc contains a bit more details, and I'd be happy to
present it earlier, if you think doing so would help.
https://lore.kernel.org/linux-mm/20220208081902.3550911-13-yuzhao@google.com/

> Explain the problem of MMU vs syscall references, and how tiering
> addresses this.

The previous patch also touched on this point:
https://lore.kernel.org/linux-mm/20220208081902.3550911-5-yuzhao@google.com/

  The protection of hot pages and the selection of cold pages are based
  on page access channels and patterns. There are two access channels:
  one through page tables and the other through file descriptors. The
  protection of the former channel is by design stronger because:
  1) The uncertainty in determining the access patterns of the former
     channel is higher due to the approximation of the accessed bit.
  2) The cost of evicting the former channel is higher due to the TLB
     flushes required and the likelihood of encountering the dirty bit.
  3) The penalty of underprotecting the former channel is higher because
     applications usually don't prepare themselves for major page faults
     like they do for blocked I/O. E.g., GUI applications commonly use
     dedicated I/O threads to avoid blocking the rendering threads.
  There are also two access patterns: one with temporal locality and the
  other without. For the reasons listed above, the former channel is
  assumed to follow the former pattern unless VM_SEQ_READ or
  VM_RAND_READ is present, and the latter channel is assumed to follow
  the latter pattern unless outlying refaults have been observed.

> Explain the significance of refaults and how the algorithm responds to
> them. Not in terms of which running averages are updated, but in terms
> of user-visible behavior ("will start swapping (more)" etc.)

And this patch touched on how tiers would help:
  1) It removes the cost of activation in the buffered access path by
     inferring whether pages accessed multiple times through file
     descriptors are statistically hot and thus worth promoting in the
     eviction path.
  2) It takes pages accessed through page tables into account and avoids
     overprotecting pages accessed multiple times through file
     descriptors. (Pages accessed through page tables are in the first
     tier since N=0.)
  3) More tiers provide better protection for pages accessed more than
     twice through file descriptors, when under heavy buffered I/O
     workloads.

And the design doc:
https://lore.kernel.org/linux-mm/20220208081902.3550911-13-yuzhao@google.com/

  To select a type and a tier to evict from, it first compares min_seq[]
  to select the older type. If they are equal, it selects the type whose
  first tier has a lower refault percentage. The first tier contains
  single-use unmapped clean pages, which are the best bet.

> Express *intent*, how it's supposed to behave wrt workloads and memory
> pressure. The code itself will explain the how, its complexity etc.

Hmm... This part I'm not so sure. It seems to me this is equivalent to
describing how it works.

> Most reviewers will understand the fundamental challenges of page
> reclaim. The difficulty is matching individual aspects of the problem
> space to your individual components and design choices you have made.
> 
> Let us in on that thinking, please ;)

Agreed. I'm sure I haven't covered everything. So I'm trying to figure
out what's important but missing/insufficient.

> > @@ -892,6 +892,50 @@ config ANON_VMA_NAME
> >  	  area from being merged with adjacent virtual memory areas due to the
> >  	  difference in their name.
> >  
> > +# multigenerational LRU {
> > +config LRU_GEN
> > +	bool "Multigenerational LRU"
> > +	depends on MMU
> > +	# the following options can use up the spare bits in page flags
> > +	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> > +	help
> > +	  A high performance LRU implementation for memory overcommit. See
> > +	  Documentation/admin-guide/mm/multigen_lru.rst and
> > +	  Documentation/vm/multigen_lru.rst for details.
> 
> These files don't exist at this time, please introduce them before or
> when referencing them. If they document things introduced later in the
> patchset, please start with a minimal version of the file and update
> it as you extend the algorithm and add optimizations etc.
> 
> It's really important to only reference previous patches, not later
> ones. This allows reviewers to read the patches linearly.  Having to
> search for missing pieces in patches you haven't looked at yet is bad.

Okay, will remove this bit from this patch.

> > +config NR_LRU_GENS
> > +	int "Max number of generations"
> > +	depends on LRU_GEN
> > +	range 4 31
> > +	default 4
> > +	help
> > +	  Do not increase this value unless you plan to use working set
> > +	  estimation and proactive reclaim to optimize job scheduling in data
> > +	  centers.
> > +
> > +	  This option uses order_base_2(N+1) bits in page flags.
> > +
> > +config TIERS_PER_GEN
> > +	int "Number of tiers per generation"
> > +	depends on LRU_GEN
> > +	range 2 4
> > +	default 4
> > +	help
> > +	  Do not decrease this value unless you run out of spare bits in page
> > +	  flags, i.e., you see the "Not enough bits in page flags" build error.
> > +
> > +	  This option uses N-2 bits in page flags.
> 
> Linus had pointed out that we shouldn't ask these questions of the
> user. How do you pick numbers here? I'm familiar with workingset
> estimation and proactive reclaim usecases but I wouldn't know.
> 
> Even if we removed the config option and hardcoded the number, this is
> a question for kernel developers: What does "4" mean? How would
> behavior differ if it were 3 or 5 instead? Presumably there is some
> sort of behavior gradient. "As you increase the number of
> generations/tiers, the user-visible behavior of the kernel will..."
> This should really be documented.
> 
> I'd also reiterate Mel's point: Distribution kernels need to support
> the full spectrum of applications and production environments. Unless
> using non-defaults it's an extremely niche usecase (like compiling out
> BUG() calls) compile-time options are not the right choice. If we do
> need a tunable, it could make more sense to have a compile time upper
> limit (to determine page flag space) combined with a runtime knob?

I agree, and I think only time can answer all theses questions :)

This effort is not in the final stage but at very its beginning. More
experiments and wilder adoption are required to see how it's going to
evolve or where it leads. For now, there is just no way to tell whether
those values make sense for the majority or we need the runtime knobs.

These are valid concerns, but TBH, I think they are minor ones because
most users need not to worry about them -- this patchset has been used
in several downstream kernels and I haven't heard any complaints about
those options/values:
https://lore.kernel.org/linux-mm/20220208081902.3550911-1-yuzhao@google.com/

1. Android ARCVM
2. Arch Linux Zen
3. Chrome OS
4. Liquorix
5. post-factum
6. XanMod

Then why do we need these options? Because there are always exceptions,
as stated in the descriptions of those options. Sometimes we just can't
decide everything for users -- the answers lie in their use cases. The
bottom line is, if this starts bothering people or gets in somebody's
way, I'd be glad to revisit. Fair enough?

Thanks!
Hillf Danton Feb. 13, 2022, 10:04 a.m. UTC | #4
Hello Yu

On Tue,  8 Feb 2022 01:18:55 -0700 Yu Zhao wrote:
> +
> +/******************************************************************************
> + *                          the aging
> + ******************************************************************************/
> +
> +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +	unsigned long old_flags, new_flags;
> +	int type = folio_is_file_lru(folio);
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> +
> +	do {
> +		new_flags = old_flags = READ_ONCE(folio->flags);
> +		VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> +
> +		new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;

Is the chance zero for deadloop if new_gen != old_gen?

> +		new_gen = (old_gen + 1) % MAX_NR_GENS;
> +
> +		new_flags &= ~LRU_GEN_MASK;
> +		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
> +		new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
> +		/* for folio_end_writeback() */

		/* for folio_end_writeback() and sort_folio() */ in terms of
reclaiming?

> +		if (reclaiming)
> +			new_flags |= BIT(PG_reclaim);
> +	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> +
> +	lru_gen_balance_size(lruvec, folio, old_gen, new_gen);
> +
> +	return new_gen;
> +}

...

> +/******************************************************************************
> + *                          the eviction
> + ******************************************************************************/
> +
> +static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
> +{

Nit, the 80-column-char format is prefered.

> +	bool success;
> +	int gen = folio_lru_gen(folio);
> +	int type = folio_is_file_lru(folio);
> +	int zone = folio_zonenum(folio);
> +	int tier = folio_lru_tier(folio);
> +	int delta = folio_nr_pages(folio);
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +	VM_BUG_ON_FOLIO(gen >= MAX_NR_GENS, folio);
> +
> +	if (!folio_evictable(folio)) {
> +		success = lru_gen_del_folio(lruvec, folio, true);
> +		VM_BUG_ON_FOLIO(!success, folio);
> +		folio_set_unevictable(folio);
> +		lruvec_add_folio(lruvec, folio);
> +		__count_vm_events(UNEVICTABLE_PGCULLED, delta);
> +		return true;
> +	}
> +
> +	if (type && folio_test_anon(folio) && folio_test_dirty(folio)) {
> +		success = lru_gen_del_folio(lruvec, folio, true);
> +		VM_BUG_ON_FOLIO(!success, folio);
> +		folio_set_swapbacked(folio);
> +		lruvec_add_folio_tail(lruvec, folio);
> +		return true;
> +	}
> +
> +	if (tier > tier_idx) {
> +		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
> +
> +		gen = folio_inc_gen(lruvec, folio, false);
> +		list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
> +
> +		WRITE_ONCE(lrugen->promoted[hist][type][tier - 1],
> +			   lrugen->promoted[hist][type][tier - 1] + delta);
> +		__mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
> +		return true;
> +	}
> +
> +	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> +	    (type && folio_test_dirty(folio))) {
> +		gen = folio_inc_gen(lruvec, folio, true);
> +		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
> +		return true;

Make the cold dirty page cache younger instead of writeout in the backgroungd
reclaimer context, and the question rising is if laundry is defered until the
flusher threads are waken up in the following patches.

> +	}
> +
> +	return false;
> +}

Hillf
Yu Zhao Feb. 17, 2022, 12:13 a.m. UTC | #5
On Sun, Feb 13, 2022 at 06:04:17PM +0800, Hillf Danton wrote:

Hi Hillf,

> On Tue,  8 Feb 2022 01:18:55 -0700 Yu Zhao wrote:
> > +
> > +/******************************************************************************
> > + *                          the aging
> > + ******************************************************************************/
> > +
> > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > +{
> > +	unsigned long old_flags, new_flags;
> > +	int type = folio_is_file_lru(folio);
> > +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > +
> > +	do {
> > +		new_flags = old_flags = READ_ONCE(folio->flags);
> > +		VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> > +
> > +		new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> 
> Is the chance zero for deadloop if new_gen != old_gen?

No, because the counter is only cleared during isolation, and here
it's protected again isolation (under the LRU lock, which is asserted
in the lru_gen_balance_size() -> lru_gen_update_size() path).

> > +		new_gen = (old_gen + 1) % MAX_NR_GENS;
> > +
> > +		new_flags &= ~LRU_GEN_MASK;
> > +		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
> > +		new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
> > +		/* for folio_end_writeback() */
> 
> 		/* for folio_end_writeback() and sort_folio() */ in terms of
> reclaiming?

Right.

> > +		if (reclaiming)
> > +			new_flags |= BIT(PG_reclaim);
> > +	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> > +
> > +	lru_gen_balance_size(lruvec, folio, old_gen, new_gen);
> > +
> > +	return new_gen;
> > +}
> 
> ...
> 
> > +/******************************************************************************
> > + *                          the eviction
> > + ******************************************************************************/
> > +
> > +static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
> > +{
> 
> Nit, the 80-column-char format is prefered.

Will do.

> > +	bool success;
> > +	int gen = folio_lru_gen(folio);
> > +	int type = folio_is_file_lru(folio);
> > +	int zone = folio_zonenum(folio);
> > +	int tier = folio_lru_tier(folio);
> > +	int delta = folio_nr_pages(folio);
> > +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +	VM_BUG_ON_FOLIO(gen >= MAX_NR_GENS, folio);
> > +
> > +	if (!folio_evictable(folio)) {
> > +		success = lru_gen_del_folio(lruvec, folio, true);
> > +		VM_BUG_ON_FOLIO(!success, folio);
> > +		folio_set_unevictable(folio);
> > +		lruvec_add_folio(lruvec, folio);
> > +		__count_vm_events(UNEVICTABLE_PGCULLED, delta);
> > +		return true;
> > +	}
> > +
> > +	if (type && folio_test_anon(folio) && folio_test_dirty(folio)) {
> > +		success = lru_gen_del_folio(lruvec, folio, true);
> > +		VM_BUG_ON_FOLIO(!success, folio);
> > +		folio_set_swapbacked(folio);
> > +		lruvec_add_folio_tail(lruvec, folio);
> > +		return true;
> > +	}
> > +
> > +	if (tier > tier_idx) {
> > +		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
> > +
> > +		gen = folio_inc_gen(lruvec, folio, false);
> > +		list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
> > +
> > +		WRITE_ONCE(lrugen->promoted[hist][type][tier - 1],
> > +			   lrugen->promoted[hist][type][tier - 1] + delta);
> > +		__mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
> > +		return true;
> > +	}
> > +
> > +	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > +	    (type && folio_test_dirty(folio))) {
> > +		gen = folio_inc_gen(lruvec, folio, true);
> > +		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
> > +		return true;
> 
> Make the cold dirty page cache younger instead of writeout in the backgroungd
> reclaimer context, and the question rising is if laundry is defered until the
> flusher threads are waken up in the following patches.

This is a good point. In contrast to the active/inactive LRU, MGLRU
doesn't write out dirty file pages (kswapd or direct reclaimers) --
this is writeback's job and it should be better at doing this. In
fact, commit 21b4ee7029 ("xfs: drop ->writepage completely") has
disabled dirty file page writeouts in the reclaim path completely.

Reclaim indirectly wakes up writeback after clean file pages drop
below a threshold (dirty ratio). However, dirty pages might be under
counted on a system that uses a large number of mmapped file pages.
MGLRU optimizes this by calling folio_mark_dirty() on pages mapped
by dirty PTEs when scanning page tables. (Why not since it's already
looking at the accessed bit.)

The commit above explained this design choice from the performance
aspect. From the implementation aspect, it also creates a boundary
between reclaim and writeback. This simplifies things, e.g., the
PageWriteback() check in shrink_page_list is no longer relevant for
MGLRU, neither is the top half of the PageDirty() check.
Huang, Ying Feb. 23, 2022, 8:27 a.m. UTC | #6
Hi, Yu,

Yu Zhao <yuzhao@google.com> writes:

> To avoid confusions, the terms "promotion" and "demotion" will be
> applied to the multigenerational LRU, as a new convention; the terms
> "activation" and "deactivation" will be applied to the active/inactive
> LRU, as usual.

In the memory tiering related commits and patchset, for example as follows,

commit 668e4147d8850df32ca41e28f52c146025ca45c6
Author: Yang Shi <yang.shi@linux.alibaba.com>
Date:   Thu Sep 2 14:59:19 2021 -0700

    mm/vmscan: add page demotion counter

https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/

"demote" and "promote" is used for migrating pages between different
types of memory.  Is it better for us to avoid overloading these words
too much to avoid the possible confusion?

> +static int get_swappiness(struct mem_cgroup *memcg)
> +{
> +	return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
> +	       mem_cgroup_swappiness(memcg) : 0;
> +}

After we introduced demotion support in Linux kernel.  The anonymous
pages in the fast memory node could be demoted to the slow memory node
via the page reclaiming mechanism as in the following commit.  Can you
consider that too?

commit a2a36488a61cefe3129295c6e75b3987b9d7fd13
Author: Keith Busch <kbusch@kernel.org>
Date:   Thu Sep 2 14:59:26 2021 -0700

    mm/vmscan: Consider anonymous pages without swap
    
    Reclaim anonymous pages if a migration path is available now that demotion
    provides a non-swap recourse for reclaiming anon pages.
    
    Note that this check is subtly different from the can_age_anon_pages()
    checks.  This mechanism checks whether a specific page in a specific
    context can actually be reclaimed, given current swap space and cgroup
    limits.
    
    can_age_anon_pages() is a much simpler and more preliminary check which
    just says whether there is a possibility of future reclaim.


Best Regards,
Huang, Ying
Yu Zhao Feb. 23, 2022, 9:36 a.m. UTC | #7
On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Yu,
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > To avoid confusions, the terms "promotion" and "demotion" will be
> > applied to the multigenerational LRU, as a new convention; the terms
> > "activation" and "deactivation" will be applied to the active/inactive
> > LRU, as usual.
>
> In the memory tiering related commits and patchset, for example as follows,
>
> commit 668e4147d8850df32ca41e28f52c146025ca45c6
> Author: Yang Shi <yang.shi@linux.alibaba.com>
> Date:   Thu Sep 2 14:59:19 2021 -0700
>
>     mm/vmscan: add page demotion counter
>
> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
>
> "demote" and "promote" is used for migrating pages between different
> types of memory.  Is it better for us to avoid overloading these words
> too much to avoid the possible confusion?

Given that LRU and migration are usually different contexts, I think
we'd be fine, unless we want a third pair of terms.

> > +static int get_swappiness(struct mem_cgroup *memcg)
> > +{
> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
> > +            mem_cgroup_swappiness(memcg) : 0;
> > +}
>
> After we introduced demotion support in Linux kernel.  The anonymous
> pages in the fast memory node could be demoted to the slow memory node
> via the page reclaiming mechanism as in the following commit.  Can you
> consider that too?

Sure. How do I check whether there is still space on the slow node?
Huang, Ying Feb. 24, 2022, 12:59 a.m. UTC | #8
Yu Zhao <yuzhao@google.com> writes:

> On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hi, Yu,
>>
>> Yu Zhao <yuzhao@google.com> writes:
>>
>> > To avoid confusions, the terms "promotion" and "demotion" will be
>> > applied to the multigenerational LRU, as a new convention; the terms
>> > "activation" and "deactivation" will be applied to the active/inactive
>> > LRU, as usual.
>>
>> In the memory tiering related commits and patchset, for example as follows,
>>
>> commit 668e4147d8850df32ca41e28f52c146025ca45c6
>> Author: Yang Shi <yang.shi@linux.alibaba.com>
>> Date:   Thu Sep 2 14:59:19 2021 -0700
>>
>>     mm/vmscan: add page demotion counter
>>
>> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
>>
>> "demote" and "promote" is used for migrating pages between different
>> types of memory.  Is it better for us to avoid overloading these words
>> too much to avoid the possible confusion?
>
> Given that LRU and migration are usually different contexts, I think
> we'd be fine, unless we want a third pair of terms.

This is true before memory tiering is introduced.  In systems with
multiple types memory (called memory tiering), LRU is used to identify
pages to be migrated to the slow memory node.  Please take a look at
can_demote(), which is called in shrink_page_list().

>> > +static int get_swappiness(struct mem_cgroup *memcg)
>> > +{
>> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
>> > +            mem_cgroup_swappiness(memcg) : 0;
>> > +}
>>
>> After we introduced demotion support in Linux kernel.  The anonymous
>> pages in the fast memory node could be demoted to the slow memory node
>> via the page reclaiming mechanism as in the following commit.  Can you
>> consider that too?
>
> Sure. How do I check whether there is still space on the slow node?

You can always check the watermark of the slow node.  But now, we
actually don't check that (as in demote_page_list()), instead we will
wake up kswapd of the slow node.  The intended behavior is something
like,

  DRAM -> PMEM -> disk

Best Regards,
Huang, Ying
Yu Zhao Feb. 24, 2022, 1:34 a.m. UTC | #9
On Wed, Feb 23, 2022 at 5:59 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Yu,
> >>
> >> Yu Zhao <yuzhao@google.com> writes:
> >>
> >> > To avoid confusions, the terms "promotion" and "demotion" will be
> >> > applied to the multigenerational LRU, as a new convention; the terms
> >> > "activation" and "deactivation" will be applied to the active/inactive
> >> > LRU, as usual.
> >>
> >> In the memory tiering related commits and patchset, for example as follows,
> >>
> >> commit 668e4147d8850df32ca41e28f52c146025ca45c6
> >> Author: Yang Shi <yang.shi@linux.alibaba.com>
> >> Date:   Thu Sep 2 14:59:19 2021 -0700
> >>
> >>     mm/vmscan: add page demotion counter
> >>
> >> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
> >>
> >> "demote" and "promote" is used for migrating pages between different
> >> types of memory.  Is it better for us to avoid overloading these words
> >> too much to avoid the possible confusion?
> >
> > Given that LRU and migration are usually different contexts, I think
> > we'd be fine, unless we want a third pair of terms.
>
> This is true before memory tiering is introduced.  In systems with
> multiple types memory (called memory tiering), LRU is used to identify
> pages to be migrated to the slow memory node.  Please take a look at
> can_demote(), which is called in shrink_page_list().

This sounds clearly two contexts to me. Promotion/demotion (move
between generations) while pages are on LRU; or promotion/demotion
(migration between nodes) after pages are taken off LRU.

Note that promotion/demotion are not used in function names. They are
used to describe how MGLRU works, in comparison with the
active/inactive LRU. Memory tiering is not within this context.

> >> > +static int get_swappiness(struct mem_cgroup *memcg)
> >> > +{
> >> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
> >> > +            mem_cgroup_swappiness(memcg) : 0;
> >> > +}
> >>
> >> After we introduced demotion support in Linux kernel.  The anonymous
> >> pages in the fast memory node could be demoted to the slow memory node
> >> via the page reclaiming mechanism as in the following commit.  Can you
> >> consider that too?
> >
> > Sure. How do I check whether there is still space on the slow node?
>
> You can always check the watermark of the slow node.  But now, we
> actually don't check that (as in demote_page_list()), instead we will
> wake up kswapd of the slow node.  The intended behavior is something
> like,
>
>   DRAM -> PMEM -> disk

I'll look into this later -- for now, it's a low priority because
there isn't much demand. I'll bump it up if anybody is interested in
giving it a try. Meanwhile, please feel free to cook up something if
you are interested.
Huang, Ying Feb. 24, 2022, 3:31 a.m. UTC | #10
Yu Zhao <yuzhao@google.com> writes:

> On Wed, Feb 23, 2022 at 5:59 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yu Zhao <yuzhao@google.com> writes:
>>
>> > On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Hi, Yu,
>> >>
>> >> Yu Zhao <yuzhao@google.com> writes:
>> >>
>> >> > To avoid confusions, the terms "promotion" and "demotion" will be
>> >> > applied to the multigenerational LRU, as a new convention; the terms
>> >> > "activation" and "deactivation" will be applied to the active/inactive
>> >> > LRU, as usual.
>> >>
>> >> In the memory tiering related commits and patchset, for example as follows,
>> >>
>> >> commit 668e4147d8850df32ca41e28f52c146025ca45c6
>> >> Author: Yang Shi <yang.shi@linux.alibaba.com>
>> >> Date:   Thu Sep 2 14:59:19 2021 -0700
>> >>
>> >>     mm/vmscan: add page demotion counter
>> >>
>> >> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
>> >>
>> >> "demote" and "promote" is used for migrating pages between different
>> >> types of memory.  Is it better for us to avoid overloading these words
>> >> too much to avoid the possible confusion?
>> >
>> > Given that LRU and migration are usually different contexts, I think
>> > we'd be fine, unless we want a third pair of terms.
>>
>> This is true before memory tiering is introduced.  In systems with
>> multiple types memory (called memory tiering), LRU is used to identify
>> pages to be migrated to the slow memory node.  Please take a look at
>> can_demote(), which is called in shrink_page_list().
>
> This sounds clearly two contexts to me. Promotion/demotion (move
> between generations) while pages are on LRU; or promotion/demotion
> (migration between nodes) after pages are taken off LRU.
>
> Note that promotion/demotion are not used in function names. They are
> used to describe how MGLRU works, in comparison with the
> active/inactive LRU. Memory tiering is not within this context.

Because we have used pgdemote_* in /proc/vmstat, "demotion_enabled" in
/sys/kernel/mm/numa, and will use pgpromote_* in /proc/vmstat.  It seems
better to avoid to use promote/demote directly for MGLRU in ABI.  A
possible solution is to use "mglru" and "promote/demote" together (such
as "mglru_promote_*" when it is needed?

>> >> > +static int get_swappiness(struct mem_cgroup *memcg)
>> >> > +{
>> >> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
>> >> > +            mem_cgroup_swappiness(memcg) : 0;
>> >> > +}
>> >>
>> >> After we introduced demotion support in Linux kernel.  The anonymous
>> >> pages in the fast memory node could be demoted to the slow memory node
>> >> via the page reclaiming mechanism as in the following commit.  Can you
>> >> consider that too?
>> >
>> > Sure. How do I check whether there is still space on the slow node?
>>
>> You can always check the watermark of the slow node.  But now, we
>> actually don't check that (as in demote_page_list()), instead we will
>> wake up kswapd of the slow node.  The intended behavior is something
>> like,
>>
>>   DRAM -> PMEM -> disk
>
> I'll look into this later -- for now, it's a low priority because
> there isn't much demand. I'll bump it up if anybody is interested in
> giving it a try. Meanwhile, please feel free to cook up something if
> you are interested.

When we introduce a new feature, we shouldn't break an existing one.
That is, not introducing regression.  I think that it is a rule?

If my understanding were correct, MGLRU will ignore to scan anonymous
page list even if there's demotion target for the node.  This breaks the
demotion feature in the upstream kernel.  Right?

It's a new feature to check whether there is still space on the slow
node.  We can look at that later.

Best Regards,
Huang, Ying
Yu Zhao Feb. 24, 2022, 4:09 a.m. UTC | #11
On Wed, Feb 23, 2022 at 8:32 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Wed, Feb 23, 2022 at 5:59 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yu Zhao <yuzhao@google.com> writes:
> >>
> >> > On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Hi, Yu,
> >> >>
> >> >> Yu Zhao <yuzhao@google.com> writes:
> >> >>
> >> >> > To avoid confusions, the terms "promotion" and "demotion" will be
> >> >> > applied to the multigenerational LRU, as a new convention; the terms
> >> >> > "activation" and "deactivation" will be applied to the active/inactive
> >> >> > LRU, as usual.
> >> >>
> >> >> In the memory tiering related commits and patchset, for example as follows,
> >> >>
> >> >> commit 668e4147d8850df32ca41e28f52c146025ca45c6
> >> >> Author: Yang Shi <yang.shi@linux.alibaba.com>
> >> >> Date:   Thu Sep 2 14:59:19 2021 -0700
> >> >>
> >> >>     mm/vmscan: add page demotion counter
> >> >>
> >> >> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
> >> >>
> >> >> "demote" and "promote" is used for migrating pages between different
> >> >> types of memory.  Is it better for us to avoid overloading these words
> >> >> too much to avoid the possible confusion?
> >> >
> >> > Given that LRU and migration are usually different contexts, I think
> >> > we'd be fine, unless we want a third pair of terms.
> >>
> >> This is true before memory tiering is introduced.  In systems with
> >> multiple types memory (called memory tiering), LRU is used to identify
> >> pages to be migrated to the slow memory node.  Please take a look at
> >> can_demote(), which is called in shrink_page_list().
> >
> > This sounds clearly two contexts to me. Promotion/demotion (move
> > between generations) while pages are on LRU; or promotion/demotion
> > (migration between nodes) after pages are taken off LRU.
> >
> > Note that promotion/demotion are not used in function names. They are
> > used to describe how MGLRU works, in comparison with the
> > active/inactive LRU. Memory tiering is not within this context.
>
> Because we have used pgdemote_* in /proc/vmstat, "demotion_enabled" in
> /sys/kernel/mm/numa, and will use pgpromote_* in /proc/vmstat.  It seems
> better to avoid to use promote/demote directly for MGLRU in ABI.  A
> possible solution is to use "mglru" and "promote/demote" together (such
> as "mglru_promote_*" when it is needed?

*If* it is needed. Currently there are no such plans.

> >> >> > +static int get_swappiness(struct mem_cgroup *memcg)
> >> >> > +{
> >> >> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
> >> >> > +            mem_cgroup_swappiness(memcg) : 0;
> >> >> > +}
> >> >>
> >> >> After we introduced demotion support in Linux kernel.  The anonymous
> >> >> pages in the fast memory node could be demoted to the slow memory node
> >> >> via the page reclaiming mechanism as in the following commit.  Can you
> >> >> consider that too?
> >> >
> >> > Sure. How do I check whether there is still space on the slow node?
> >>
> >> You can always check the watermark of the slow node.  But now, we
> >> actually don't check that (as in demote_page_list()), instead we will
> >> wake up kswapd of the slow node.  The intended behavior is something
> >> like,
> >>
> >>   DRAM -> PMEM -> disk
> >
> > I'll look into this later -- for now, it's a low priority because
> > there isn't much demand. I'll bump it up if anybody is interested in
> > giving it a try. Meanwhile, please feel free to cook up something if
> > you are interested.
>
> When we introduce a new feature, we shouldn't break an existing one.
> That is, not introducing regression.  I think that it is a rule?
>
> If my understanding were correct, MGLRU will ignore to scan anonymous
> page list even if there's demotion target for the node.  This breaks the
> demotion feature in the upstream kernel.  Right?

I'm not saying this shouldn't be fixed. I'm saying it's a low priority
until somebody is interested in using/testing it (or making it work).

Regarding regressions, I'm sure MGLRU *will* regress many workloads.
Its goal is to improve the majority of use cases, i.e., total net
gain. Trying to improve everything is methodically wrong because the
problem space is near infinite but the resource is limited. So we have
to prioritize major use cases over minor ones. The bottom line is
users have a choice not to use MGLRU.

> It's a new feature to check whether there is still space on the slow
> node.  We can look at that later.

SGTM.
Huang, Ying Feb. 24, 2022, 5:27 a.m. UTC | #12
Yu Zhao <yuzhao@google.com> writes:

> On Wed, Feb 23, 2022 at 8:32 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yu Zhao <yuzhao@google.com> writes:
>>
>> > On Wed, Feb 23, 2022 at 5:59 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yu Zhao <yuzhao@google.com> writes:
>> >>
>> >> > On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Hi, Yu,
>> >> >>
>> >> >> Yu Zhao <yuzhao@google.com> writes:
>> >> >>
>> >> >> > To avoid confusions, the terms "promotion" and "demotion" will be
>> >> >> > applied to the multigenerational LRU, as a new convention; the terms
>> >> >> > "activation" and "deactivation" will be applied to the active/inactive
>> >> >> > LRU, as usual.
>> >> >>
>> >> >> In the memory tiering related commits and patchset, for example as follows,
>> >> >>
>> >> >> commit 668e4147d8850df32ca41e28f52c146025ca45c6
>> >> >> Author: Yang Shi <yang.shi@linux.alibaba.com>
>> >> >> Date:   Thu Sep 2 14:59:19 2021 -0700
>> >> >>
>> >> >>     mm/vmscan: add page demotion counter
>> >> >>
>> >> >> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
>> >> >>
>> >> >> "demote" and "promote" is used for migrating pages between different
>> >> >> types of memory.  Is it better for us to avoid overloading these words
>> >> >> too much to avoid the possible confusion?
>> >> >
>> >> > Given that LRU and migration are usually different contexts, I think
>> >> > we'd be fine, unless we want a third pair of terms.
>> >>
>> >> This is true before memory tiering is introduced.  In systems with
>> >> multiple types memory (called memory tiering), LRU is used to identify
>> >> pages to be migrated to the slow memory node.  Please take a look at
>> >> can_demote(), which is called in shrink_page_list().
>> >
>> > This sounds clearly two contexts to me. Promotion/demotion (move
>> > between generations) while pages are on LRU; or promotion/demotion
>> > (migration between nodes) after pages are taken off LRU.
>> >
>> > Note that promotion/demotion are not used in function names. They are
>> > used to describe how MGLRU works, in comparison with the
>> > active/inactive LRU. Memory tiering is not within this context.
>>
>> Because we have used pgdemote_* in /proc/vmstat, "demotion_enabled" in
>> /sys/kernel/mm/numa, and will use pgpromote_* in /proc/vmstat.  It seems
>> better to avoid to use promote/demote directly for MGLRU in ABI.  A
>> possible solution is to use "mglru" and "promote/demote" together (such
>> as "mglru_promote_*" when it is needed?
>
> *If* it is needed. Currently there are no such plans.

OK.

>> >> >> > +static int get_swappiness(struct mem_cgroup *memcg)
>> >> >> > +{
>> >> >> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
>> >> >> > +            mem_cgroup_swappiness(memcg) : 0;
>> >> >> > +}
>> >> >>
>> >> >> After we introduced demotion support in Linux kernel.  The anonymous
>> >> >> pages in the fast memory node could be demoted to the slow memory node
>> >> >> via the page reclaiming mechanism as in the following commit.  Can you
>> >> >> consider that too?
>> >> >
>> >> > Sure. How do I check whether there is still space on the slow node?
>> >>
>> >> You can always check the watermark of the slow node.  But now, we
>> >> actually don't check that (as in demote_page_list()), instead we will
>> >> wake up kswapd of the slow node.  The intended behavior is something
>> >> like,
>> >>
>> >>   DRAM -> PMEM -> disk
>> >
>> > I'll look into this later -- for now, it's a low priority because
>> > there isn't much demand. I'll bump it up if anybody is interested in
>> > giving it a try. Meanwhile, please feel free to cook up something if
>> > you are interested.
>>
>> When we introduce a new feature, we shouldn't break an existing one.
>> That is, not introducing regression.  I think that it is a rule?
>>
>> If my understanding were correct, MGLRU will ignore to scan anonymous
>> page list even if there's demotion target for the node.  This breaks the
>> demotion feature in the upstream kernel.  Right?
>
> I'm not saying this shouldn't be fixed. I'm saying it's a low priority
> until somebody is interested in using/testing it (or making it work).

We are interested in this feature and can help to test it.

> Regarding regressions, I'm sure MGLRU *will* regress many workloads.
> Its goal is to improve the majority of use cases, i.e., total net
> gain. Trying to improve everything is methodically wrong because the
> problem space is near infinite but the resource is limited. So we have
> to prioritize major use cases over minor ones. The bottom line is
> users have a choice not to use MGLRU.

This is a functionality regression, not performance regression.  Without
demotion support, some workloads will go OOM when DRAM is used up (while
PMEM isn't) if PMEM is onlined in movable zone (as recommended).

>> It's a new feature to check whether there is still space on the slow
>> node.  We can look at that later.
>
> SGTM.

Best Regards,
Huang, Ying
Yu Zhao Feb. 24, 2022, 5:35 a.m. UTC | #13
On Wed, Feb 23, 2022 at 10:27 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Wed, Feb 23, 2022 at 8:32 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yu Zhao <yuzhao@google.com> writes:
> >>
> >> > On Wed, Feb 23, 2022 at 5:59 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yu Zhao <yuzhao@google.com> writes:
> >> >>
> >> >> > On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Hi, Yu,
> >> >> >>
> >> >> >> Yu Zhao <yuzhao@google.com> writes:
> >> >> >>
> >> >> >> > To avoid confusions, the terms "promotion" and "demotion" will be
> >> >> >> > applied to the multigenerational LRU, as a new convention; the terms
> >> >> >> > "activation" and "deactivation" will be applied to the active/inactive
> >> >> >> > LRU, as usual.
> >> >> >>
> >> >> >> In the memory tiering related commits and patchset, for example as follows,
> >> >> >>
> >> >> >> commit 668e4147d8850df32ca41e28f52c146025ca45c6
> >> >> >> Author: Yang Shi <yang.shi@linux.alibaba.com>
> >> >> >> Date:   Thu Sep 2 14:59:19 2021 -0700
> >> >> >>
> >> >> >>     mm/vmscan: add page demotion counter
> >> >> >>
> >> >> >> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
> >> >> >>
> >> >> >> "demote" and "promote" is used for migrating pages between different
> >> >> >> types of memory.  Is it better for us to avoid overloading these words
> >> >> >> too much to avoid the possible confusion?
> >> >> >
> >> >> > Given that LRU and migration are usually different contexts, I think
> >> >> > we'd be fine, unless we want a third pair of terms.
> >> >>
> >> >> This is true before memory tiering is introduced.  In systems with
> >> >> multiple types memory (called memory tiering), LRU is used to identify
> >> >> pages to be migrated to the slow memory node.  Please take a look at
> >> >> can_demote(), which is called in shrink_page_list().
> >> >
> >> > This sounds clearly two contexts to me. Promotion/demotion (move
> >> > between generations) while pages are on LRU; or promotion/demotion
> >> > (migration between nodes) after pages are taken off LRU.
> >> >
> >> > Note that promotion/demotion are not used in function names. They are
> >> > used to describe how MGLRU works, in comparison with the
> >> > active/inactive LRU. Memory tiering is not within this context.
> >>
> >> Because we have used pgdemote_* in /proc/vmstat, "demotion_enabled" in
> >> /sys/kernel/mm/numa, and will use pgpromote_* in /proc/vmstat.  It seems
> >> better to avoid to use promote/demote directly for MGLRU in ABI.  A
> >> possible solution is to use "mglru" and "promote/demote" together (such
> >> as "mglru_promote_*" when it is needed?
> >
> > *If* it is needed. Currently there are no such plans.
>
> OK.
>
> >> >> >> > +static int get_swappiness(struct mem_cgroup *memcg)
> >> >> >> > +{
> >> >> >> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
> >> >> >> > +            mem_cgroup_swappiness(memcg) : 0;
> >> >> >> > +}
> >> >> >>
> >> >> >> After we introduced demotion support in Linux kernel.  The anonymous
> >> >> >> pages in the fast memory node could be demoted to the slow memory node
> >> >> >> via the page reclaiming mechanism as in the following commit.  Can you
> >> >> >> consider that too?
> >> >> >
> >> >> > Sure. How do I check whether there is still space on the slow node?
> >> >>
> >> >> You can always check the watermark of the slow node.  But now, we
> >> >> actually don't check that (as in demote_page_list()), instead we will
> >> >> wake up kswapd of the slow node.  The intended behavior is something
> >> >> like,
> >> >>
> >> >>   DRAM -> PMEM -> disk
> >> >
> >> > I'll look into this later -- for now, it's a low priority because
> >> > there isn't much demand. I'll bump it up if anybody is interested in
> >> > giving it a try. Meanwhile, please feel free to cook up something if
> >> > you are interested.
> >>
> >> When we introduce a new feature, we shouldn't break an existing one.
> >> That is, not introducing regression.  I think that it is a rule?
> >>
> >> If my understanding were correct, MGLRU will ignore to scan anonymous
> >> page list even if there's demotion target for the node.  This breaks the
> >> demotion feature in the upstream kernel.  Right?
> >
> > I'm not saying this shouldn't be fixed. I'm saying it's a low priority
> > until somebody is interested in using/testing it (or making it work).
>
> We are interested in this feature and can help to test it.

That's great. I'll make sure it works in the next version.
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 05dd33265740..b4b9886ba277 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -227,6 +227,7 @@  int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
 #define PAGE_ALIGNED(addr)	IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
 
 #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
+#define lru_to_folio(head) (list_entry((head)->prev, struct folio, lru))
 
 void setup_initial_init_mm(void *start_code, void *end_code,
 			   void *end_data, void *brk);
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 46f4fde0299f..37c8a0ede4ff 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -109,6 +109,19 @@  static inline int lru_gen_from_seq(unsigned long seq)
 	return seq % MAX_NR_GENS;
 }
 
+static inline int lru_hist_from_seq(unsigned long seq)
+{
+	return seq % NR_HIST_GENS;
+}
+
+static inline int lru_tier_from_refs(int refs)
+{
+	VM_BUG_ON(refs > BIT(LRU_REFS_WIDTH));
+
+	/* see the comment on MAX_NR_TIERS */
+	return order_base_2(refs + 1);
+}
+
 static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
 {
 	unsigned long max_seq = lruvec->lrugen.max_seq;
@@ -237,6 +250,8 @@  static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio,
 		gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
 
 		new_flags &= ~LRU_GEN_MASK;
+		if ((new_flags & LRU_REFS_FLAGS) != LRU_REFS_FLAGS)
+			new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
 		/* for shrink_page_list() */
 		if (reclaiming)
 			new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim));
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0f5e8a995781..3870dd9246a2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -335,6 +335,32 @@  struct lruvec;
 #define MIN_NR_GENS		2U
 #define MAX_NR_GENS		((unsigned int)CONFIG_NR_LRU_GENS)
 
+/*
+ * Each generation is divided into multiple tiers. Tiers represent different
+ * ranges of numbers of accesses through file descriptors. A page accessed N
+ * times through file descriptors is in tier order_base_2(N). A page in the
+ * first tier (N=0,1) is marked by PG_referenced unless it was faulted in
+ * though page tables or read ahead. A page in any other tier (N>1) is marked
+ * by PG_referenced and PG_workingset. Additional bits in folio->flags are
+ * required to support more than two tiers.
+ *
+ * In contrast to moving across generations which requires the LRU lock, moving
+ * across tiers only requires operations on folio->flags and therefore has a
+ * negligible cost in the buffered access path. In the eviction path,
+ * comparisons of refaulted/(evicted+promoted) from the first tier and the rest
+ * infer whether pages accessed multiple times through file descriptors are
+ * statistically hot and thus worth promoting.
+ */
+#define MAX_NR_TIERS		((unsigned int)CONFIG_TIERS_PER_GEN)
+#define LRU_REFS_FLAGS		(BIT(PG_referenced) | BIT(PG_workingset))
+
+/* whether to keep historical stats from evicted generations */
+#ifdef CONFIG_LRU_GEN_STATS
+#define NR_HIST_GENS		((unsigned int)CONFIG_NR_LRU_GENS)
+#else
+#define NR_HIST_GENS		1U
+#endif
+
 struct lru_gen_struct {
 	/* the aging increments the youngest generation number */
 	unsigned long max_seq;
@@ -346,6 +372,15 @@  struct lru_gen_struct {
 	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
 	/* the sizes of the above lists */
 	unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* the exponential moving average of refaulted */
+	unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
+	/* the exponential moving average of evicted+promoted */
+	unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
+	/* the first tier doesn't need promotion, hence the minus one */
+	unsigned long promoted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
+	/* can be modified without holding the LRU lock */
+	atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
+	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 	/* whether the multigenerational LRU is enabled */
 	bool enabled;
 };
diff --git a/mm/Kconfig b/mm/Kconfig
index 3326ee3903f3..e899623d5df0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -892,6 +892,50 @@  config ANON_VMA_NAME
 	  area from being merged with adjacent virtual memory areas due to the
 	  difference in their name.
 
+# multigenerational LRU {
+config LRU_GEN
+	bool "Multigenerational LRU"
+	depends on MMU
+	# the following options can use up the spare bits in page flags
+	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
+	help
+	  A high performance LRU implementation for memory overcommit. See
+	  Documentation/admin-guide/mm/multigen_lru.rst and
+	  Documentation/vm/multigen_lru.rst for details.
+
+config NR_LRU_GENS
+	int "Max number of generations"
+	depends on LRU_GEN
+	range 4 31
+	default 4
+	help
+	  Do not increase this value unless you plan to use working set
+	  estimation and proactive reclaim to optimize job scheduling in data
+	  centers.
+
+	  This option uses order_base_2(N+1) bits in page flags.
+
+config TIERS_PER_GEN
+	int "Number of tiers per generation"
+	depends on LRU_GEN
+	range 2 4
+	default 4
+	help
+	  Do not decrease this value unless you run out of spare bits in page
+	  flags, i.e., you see the "Not enough bits in page flags" build error.
+
+	  This option uses N-2 bits in page flags.
+
+config LRU_GEN_STATS
+	bool "Full stats for debugging"
+	depends on LRU_GEN
+	help
+	  Do not enable this option unless you plan to look at historical stats
+	  from evicted generations for debugging purpose.
+
+	  This option has a per-memcg and per-node memory overhead.
+# }
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/swap.c b/mm/swap.c
index e2ef2acccc74..f5c0bcac8dcd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -407,6 +407,43 @@  static void __lru_cache_activate_folio(struct folio *folio)
 	local_unlock(&lru_pvecs.lock);
 }
 
+#ifdef CONFIG_LRU_GEN
+static void folio_inc_refs(struct folio *folio)
+{
+	unsigned long refs;
+	unsigned long old_flags, new_flags;
+
+	if (folio_test_unevictable(folio))
+		return;
+
+	/* see the comment on MAX_NR_TIERS */
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+
+		if (!(new_flags & BIT(PG_referenced))) {
+			new_flags |= BIT(PG_referenced);
+			continue;
+		}
+
+		if (!(new_flags & BIT(PG_workingset))) {
+			new_flags |= BIT(PG_workingset);
+			continue;
+		}
+
+		refs = new_flags & LRU_REFS_MASK;
+		refs = min(refs + BIT(LRU_REFS_PGOFF), LRU_REFS_MASK);
+
+		new_flags &= ~LRU_REFS_MASK;
+		new_flags |= refs;
+	} while (new_flags != old_flags &&
+		 cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+}
+#else
+static void folio_inc_refs(struct folio *folio)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
 /*
  * Mark a page as having seen activity.
  *
@@ -419,6 +456,11 @@  static void __lru_cache_activate_folio(struct folio *folio)
  */
 void folio_mark_accessed(struct folio *folio)
 {
+	if (lru_gen_enabled()) {
+		folio_inc_refs(folio);
+		return;
+	}
+
 	if (!folio_test_referenced(folio)) {
 		folio_set_referenced(folio);
 	} else if (folio_test_unevictable(folio)) {
@@ -568,7 +610,7 @@  static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageActive(page) && !PageUnevictable(page)) {
+	if (!PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
 		int nr_pages = thp_nr_pages(page);
 
 		del_page_from_lru_list(page, lruvec);
@@ -682,7 +724,7 @@  void deactivate_file_page(struct page *page)
  */
 void deactivate_page(struct page *page)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+	if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
 		struct pagevec *pvec;
 
 		local_lock(&lru_pvecs.lock);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d75a5738d1dc..5f0d92838712 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1285,9 +1285,11 @@  static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
-		mem_cgroup_swapout(page, swap);
+
+		/* get a shadow entry before mem_cgroup_swapout() clears folio_memcg() */
 		if (reclaimed && !mapping_exiting(mapping))
 			shadow = workingset_eviction(page, target_memcg);
+		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page, swap, shadow);
 		xa_unlock_irq(&mapping->i_pages);
 		put_swap_page(page, swap);
@@ -2721,6 +2723,9 @@  static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long file;
 	struct lruvec *target_lruvec;
 
+	if (lru_gen_enabled())
+		return;
+
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
 
 	/*
@@ -3042,15 +3047,47 @@  static bool can_age_anon_pages(struct pglist_data *pgdat,
 
 #ifdef CONFIG_LRU_GEN
 
+enum {
+	TYPE_ANON,
+	TYPE_FILE,
+};
+
 /******************************************************************************
  *                          shorthand helpers
  ******************************************************************************/
 
+#define DEFINE_MAX_SEQ(lruvec)						\
+	unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq)
+
+#define DEFINE_MIN_SEQ(lruvec)						\
+	unsigned long min_seq[ANON_AND_FILE] = {			\
+		READ_ONCE((lruvec)->lrugen.min_seq[TYPE_ANON]),		\
+		READ_ONCE((lruvec)->lrugen.min_seq[TYPE_FILE]),		\
+	}
+
 #define for_each_gen_type_zone(gen, type, zone)				\
 	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
 		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
 			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
 
+static int folio_lru_gen(struct folio *folio)
+{
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
+static int folio_lru_tier(struct folio *folio)
+{
+	int refs;
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	refs = (flags & LRU_REFS_FLAGS) == LRU_REFS_FLAGS ?
+	       ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + 1 : 0;
+
+	return lru_tier_from_refs(refs);
+}
+
 static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
@@ -3069,6 +3106,728 @@  static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 	return pgdat ? &pgdat->__lruvec : NULL;
 }
 
+static int get_swappiness(struct mem_cgroup *memcg)
+{
+	return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
+	       mem_cgroup_swappiness(memcg) : 0;
+}
+
+static int get_nr_gens(struct lruvec *lruvec, int type)
+{
+	return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1;
+}
+
+static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
+{
+	/*
+	 * Ideally anon and file min_seq should be in sync. But swapping isn't
+	 * as reliable as dropping clean file pages, e.g., out of swap space. So
+	 * allow file min_seq to advance and leave anon min_seq behind, but not
+	 * the other way around.
+	 */
+	return get_nr_gens(lruvec, TYPE_FILE) >= MIN_NR_GENS &&
+	       get_nr_gens(lruvec, TYPE_FILE) <= get_nr_gens(lruvec, TYPE_ANON) &&
+	       get_nr_gens(lruvec, TYPE_ANON) <= MAX_NR_GENS;
+}
+
+/******************************************************************************
+ *                          refault feedback loop
+ ******************************************************************************/
+
+/*
+ * A feedback loop based on Proportional-Integral-Derivative (PID) controller.
+ *
+ * The P term is refaulted/(evicted+promoted) from a tier in the generation
+ * currently being evicted; the I term is the exponential moving average of the
+ * P term over the generations previously evicted, using the smoothing factor
+ * 1/2; the D term isn't supported.
+ *
+ * The setpoint (SP) is always the first tier of one type; the process variable
+ * (PV) is either any tier of the other type or any other tier of the same
+ * type.
+ *
+ * The error is the difference between the SP and the PV; the correction is
+ * turn off promotion when SP>PV or turn on promotion when SP<PV.
+ *
+ * For future optimizations:
+ * 1) The D term may discount the other two terms over time so that long-lived
+ *    generations can resist stale information.
+ */
+struct ctrl_pos {
+	unsigned long refaulted;
+	unsigned long total;
+	int gain;
+};
+
+static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
+			  struct ctrl_pos *pos)
+{
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	int hist = lru_hist_from_seq(lrugen->min_seq[type]);
+
+	pos->refaulted = lrugen->avg_refaulted[type][tier] +
+			 atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+	pos->total = lrugen->avg_total[type][tier] +
+		     atomic_long_read(&lrugen->evicted[hist][type][tier]);
+	if (tier)
+		pos->total += lrugen->promoted[hist][type][tier - 1];
+	pos->gain = gain;
+}
+
+static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
+{
+	int hist, tier;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1;
+	unsigned long seq = carryover ? lrugen->min_seq[type] : lrugen->max_seq + 1;
+
+	lockdep_assert_held(&lruvec->lru_lock);
+
+	if (!carryover && !clear)
+		return;
+
+	hist = lru_hist_from_seq(seq);
+
+	for (tier = 0; tier < MAX_NR_TIERS; tier++) {
+		if (carryover) {
+			unsigned long sum;
+
+			sum = lrugen->avg_refaulted[type][tier] +
+			      atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+			WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
+
+			sum = lrugen->avg_total[type][tier] +
+			      atomic_long_read(&lrugen->evicted[hist][type][tier]);
+			if (tier)
+				sum += lrugen->promoted[hist][type][tier - 1];
+			WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
+		}
+
+		if (clear) {
+			atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
+			atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
+			if (tier)
+				WRITE_ONCE(lrugen->promoted[hist][type][tier - 1], 0);
+		}
+	}
+}
+
+static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
+{
+	/*
+	 * Return true if the PV has a limited number of refaults or a lower
+	 * refaulted/total than the SP.
+	 */
+	return pv->refaulted < MIN_LRU_BATCH ||
+	       pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <=
+	       (sp->refaulted + 1) * pv->total * pv->gain;
+}
+
+/******************************************************************************
+ *                          the aging
+ ******************************************************************************/
+
+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	unsigned long old_flags, new_flags;
+	int type = folio_is_file_lru(folio);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+		VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
+
+		new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+		new_gen = (old_gen + 1) % MAX_NR_GENS;
+
+		new_flags &= ~LRU_GEN_MASK;
+		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
+		new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
+		/* for folio_end_writeback() */
+		if (reclaiming)
+			new_flags |= BIT(PG_reclaim);
+	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_balance_size(lruvec, folio, old_gen, new_gen);
+
+	return new_gen;
+}
+
+static void inc_min_seq(struct lruvec *lruvec)
+{
+	int type;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
+			continue;
+
+		reset_ctrl_pos(lruvec, type, true);
+		WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
+	}
+}
+
+static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
+{
+	int gen, type, zone;
+	bool success = false;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	DEFINE_MIN_SEQ(lruvec);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		while (lrugen->max_seq >= min_seq[type] + MIN_NR_GENS) {
+			gen = lru_gen_from_seq(min_seq[type]);
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+				if (!list_empty(&lrugen->lists[gen][type][zone]))
+					goto next;
+			}
+
+			min_seq[type]++;
+		}
+next:
+		;
+	}
+
+	/* see the comment in seq_is_valid() */
+	if (can_swap) {
+		min_seq[TYPE_ANON] = min(min_seq[TYPE_ANON], min_seq[TYPE_FILE]);
+		min_seq[TYPE_FILE] = max(min_seq[TYPE_ANON], lrugen->min_seq[TYPE_FILE]);
+	}
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		if (min_seq[type] == lrugen->min_seq[type])
+			continue;
+
+		reset_ctrl_pos(lruvec, type, true);
+		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
+		success = true;
+	}
+
+	return success;
+}
+
+static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
+{
+	int prev, next;
+	int type, zone;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	if (max_seq != lrugen->max_seq)
+		goto unlock;
+
+	inc_min_seq(lruvec);
+
+	/* update the active/inactive LRU sizes for compatibility */
+	prev = lru_gen_from_seq(lrugen->max_seq - 1);
+	next = lru_gen_from_seq(lrugen->max_seq + 1);
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			enum lru_list lru = type * LRU_INACTIVE_FILE;
+			long delta = lrugen->nr_pages[prev][type][zone] -
+				     lrugen->nr_pages[next][type][zone];
+
+			if (!delta)
+				continue;
+
+			lru_gen_update_size(lruvec, lru, zone, delta);
+			lru_gen_update_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
+		}
+	}
+
+	for (type = 0; type < ANON_AND_FILE; type++)
+		reset_ctrl_pos(lruvec, type, false);
+
+	WRITE_ONCE(lrugen->timestamps[next], jiffies);
+	/* make sure preceding modifications appear */
+	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
+unlock:
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
+			     unsigned long *min_seq, bool can_swap, bool *need_aging)
+{
+	int gen, type, zone;
+	long total = 0;
+	long young = 0;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		unsigned long seq;
+
+		for (seq = min_seq[type]; seq <= max_seq; seq++) {
+			long size = 0;
+
+			gen = lru_gen_from_seq(seq);
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++)
+				size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
+
+			total += size;
+			if (seq == max_seq)
+				young += size;
+		}
+	}
+
+	/* try to spread pages out across MIN_NR_GENS+1 generations */
+	if (max_seq < min_seq[TYPE_FILE] + MIN_NR_GENS)
+		*need_aging = true;
+	else if (max_seq > min_seq[TYPE_FILE] + MIN_NR_GENS)
+		*need_aging = false;
+	else
+		*need_aging = young * MIN_NR_GENS > total;
+
+	return total > 0 ? total : 0;
+}
+
+static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	bool need_aging;
+	long nr_to_scan;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	int swappiness = get_swappiness(memcg);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	mem_cgroup_calculate_protection(NULL, memcg);
+
+	if (mem_cgroup_below_min(memcg))
+		return;
+
+	nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, swappiness, &need_aging);
+	if (!nr_to_scan)
+		return;
+
+	nr_to_scan >>= sc->priority;
+
+	if (!mem_cgroup_online(memcg))
+		nr_to_scan++;
+
+	if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
+		inc_max_seq(lruvec, max_seq);
+}
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+	struct mem_cgroup *memcg;
+
+	VM_BUG_ON(!current_is_kswapd());
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
+		age_lruvec(lruvec, sc);
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+}
+
+/******************************************************************************
+ *                          the eviction
+ ******************************************************************************/
+
+static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
+{
+	bool success;
+	int gen = folio_lru_gen(folio);
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	int tier = folio_lru_tier(folio);
+	int delta = folio_nr_pages(folio);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON_FOLIO(gen >= MAX_NR_GENS, folio);
+
+	if (!folio_evictable(folio)) {
+		success = lru_gen_del_folio(lruvec, folio, true);
+		VM_BUG_ON_FOLIO(!success, folio);
+		folio_set_unevictable(folio);
+		lruvec_add_folio(lruvec, folio);
+		__count_vm_events(UNEVICTABLE_PGCULLED, delta);
+		return true;
+	}
+
+	if (type && folio_test_anon(folio) && folio_test_dirty(folio)) {
+		success = lru_gen_del_folio(lruvec, folio, true);
+		VM_BUG_ON_FOLIO(!success, folio);
+		folio_set_swapbacked(folio);
+		lruvec_add_folio_tail(lruvec, folio);
+		return true;
+	}
+
+	if (tier > tier_idx) {
+		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
+
+		gen = folio_inc_gen(lruvec, folio, false);
+		list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
+
+		WRITE_ONCE(lrugen->promoted[hist][type][tier - 1],
+			   lrugen->promoted[hist][type][tier - 1] + delta);
+		__mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+		return true;
+	}
+
+	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
+	    (type && folio_test_dirty(folio))) {
+		gen = folio_inc_gen(lruvec, folio, true);
+		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
+		return true;
+	}
+
+	return false;
+}
+
+static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc)
+{
+	bool success;
+
+	if (!sc->may_unmap && folio_mapped(folio))
+		return false;
+
+	if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) &&
+	    (folio_test_dirty(folio) ||
+	     (folio_test_anon(folio) && !folio_test_swapcache(folio))))
+		return false;
+
+	if (!folio_try_get(folio))
+		return false;
+
+	if (!folio_test_clear_lru(folio)) {
+		folio_put(folio);
+		return false;
+	}
+
+	success = lru_gen_del_folio(lruvec, folio, true);
+	VM_BUG_ON_FOLIO(!success, folio);
+
+	return true;
+}
+
+static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
+		       int type, int tier, struct list_head *list)
+{
+	int gen, zone;
+	enum vm_event_item item;
+	int sorted = 0;
+	int scanned = 0;
+	int isolated = 0;
+	int remaining = MAX_LRU_BATCH;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	VM_BUG_ON(!list_empty(list));
+
+	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
+		return 0;
+
+	gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	for (zone = sc->reclaim_idx; zone >= 0; zone--) {
+		LIST_HEAD(moved);
+		int skipped = 0;
+		struct list_head *head = &lrugen->lists[gen][type][zone];
+
+		while (!list_empty(head)) {
+			struct folio *folio = lru_to_folio(head);
+			int delta = folio_nr_pages(folio);
+
+			VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+			VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+			VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
+			VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
+
+			scanned += delta;
+
+			if (sort_folio(lruvec, folio, tier))
+				sorted += delta;
+			else if (isolate_folio(lruvec, folio, sc)) {
+				list_add(&folio->lru, list);
+				isolated += delta;
+			} else {
+				list_move(&folio->lru, &moved);
+				skipped += delta;
+			}
+
+			if (!--remaining || max(isolated, skipped) >= MIN_LRU_BATCH)
+				break;
+		}
+
+		if (skipped) {
+			list_splice(&moved, head);
+			__count_zid_vm_events(PGSCAN_SKIP, zone, skipped);
+		}
+
+		if (!remaining || isolated >= MIN_LRU_BATCH)
+			break;
+	}
+
+	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
+	if (!cgroup_reclaim(sc)) {
+		__count_vm_events(item, isolated);
+		__count_vm_events(PGREFILL, sorted);
+	}
+	__count_memcg_events(memcg, item, isolated);
+	__count_memcg_events(memcg, PGREFILL, sorted);
+	__count_vm_events(PGSCAN_ANON + type, isolated);
+
+	/*
+	 * There might not be eligible pages due to reclaim_idx, may_unmap and
+	 * may_writepage. Check the remaining to prevent livelock if there is no
+	 * progress.
+	 */
+	return isolated || !remaining ? scanned : 0;
+}
+
+static int get_tier_idx(struct lruvec *lruvec, int type)
+{
+	int tier;
+	struct ctrl_pos sp, pv;
+
+	/*
+	 * To leave a margin for fluctuations, use a larger gain factor (1:2).
+	 * This value is chosen because any other tier would have at least twice
+	 * as many refaults as the first tier.
+	 */
+	read_ctrl_pos(lruvec, type, 0, 1, &sp);
+	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
+		read_ctrl_pos(lruvec, type, tier, 2, &pv);
+		if (!positive_ctrl_err(&sp, &pv))
+			break;
+	}
+
+	return tier - 1;
+}
+
+static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx)
+{
+	int type, tier;
+	struct ctrl_pos sp, pv;
+	int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
+
+	/*
+	 * Compare the first tier of anon with that of file to determine which
+	 * type to scan. Also need to compare other tiers of the selected type
+	 * with the first tier of the other type to determine the last tier (of
+	 * the selected type) to evict.
+	 */
+	read_ctrl_pos(lruvec, TYPE_ANON, 0, gain[TYPE_ANON], &sp);
+	read_ctrl_pos(lruvec, TYPE_FILE, 0, gain[TYPE_FILE], &pv);
+	type = positive_ctrl_err(&sp, &pv);
+
+	read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
+	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
+		read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
+		if (!positive_ctrl_err(&sp, &pv))
+			break;
+	}
+
+	*tier_idx = tier - 1;
+
+	return type;
+}
+
+static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
+			  int *type_scanned, struct list_head *list)
+{
+	int i;
+	int type;
+	int scanned;
+	int tier = -1;
+	DEFINE_MIN_SEQ(lruvec);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	/*
+	 * Try to make the obvious choice first. When anon and file are both
+	 * available from the same generation, interpret swappiness 1 as file
+	 * first and 200 as anon first.
+	 */
+	if (!swappiness)
+		type = TYPE_FILE;
+	else if (min_seq[TYPE_ANON] < min_seq[TYPE_FILE])
+		type = TYPE_ANON;
+	else if (swappiness == 1)
+		type = TYPE_FILE;
+	else if (swappiness == 200)
+		type = TYPE_ANON;
+	else
+		type = get_type_to_scan(lruvec, swappiness, &tier);
+
+	for (i = !swappiness; i < ANON_AND_FILE; i++) {
+		if (tier < 0)
+			tier = get_tier_idx(lruvec, type);
+
+		scanned = scan_folios(lruvec, sc, type, tier, list);
+		if (scanned)
+			break;
+
+		type = !type;
+		tier = -1;
+	}
+
+	*type_scanned = type;
+
+	return scanned;
+}
+
+static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+{
+	int type;
+	int scanned;
+	int reclaimed;
+	LIST_HEAD(list);
+	struct folio *folio;
+	enum vm_event_item item;
+	struct reclaim_stat stat;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
+
+	if (try_to_inc_min_seq(lruvec, swappiness))
+		scanned++;
+
+	if (get_nr_gens(lruvec, TYPE_FILE) == MIN_NR_GENS)
+		scanned = 0;
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	if (list_empty(&list))
+		return scanned;
+
+	reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false);
+
+	/*
+	 * To avoid livelock, don't add rejected pages back to the same lists
+	 * they were isolated from.
+	 */
+	list_for_each_entry(folio, &list, lru) {
+		if ((folio_is_file_lru(folio) || folio_test_swapcache(folio)) &&
+		    (!folio_test_reclaim(folio) ||
+		     !(folio_test_dirty(folio) || folio_test_writeback(folio))))
+			folio_set_active(folio);
+
+		folio_clear_referenced(folio);
+		folio_clear_workingset(folio);
+	}
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	move_pages_to_lru(lruvec, &list);
+
+	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
+	if (!cgroup_reclaim(sc))
+		__count_vm_events(item, reclaimed);
+	__count_memcg_events(memcg, item, reclaimed);
+	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	mem_cgroup_uncharge_list(&list);
+	free_unref_page_list(&list);
+
+	sc->nr_reclaimed += reclaimed;
+
+	return scanned;
+}
+
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap)
+{
+	bool need_aging;
+	long nr_to_scan;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	if (mem_cgroup_below_min(memcg) ||
+	    (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
+		return 0;
+
+	nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, can_swap, &need_aging);
+	if (!nr_to_scan)
+		return 0;
+
+	/* reset the priority if the target has been met */
+	nr_to_scan >>= sc->nr_reclaimed < sc->nr_to_reclaim ? sc->priority : DEF_PRIORITY;
+
+	if (!mem_cgroup_online(memcg))
+		nr_to_scan++;
+
+	if (!nr_to_scan)
+		return 0;
+
+	if (!need_aging)
+		return nr_to_scan;
+
+	/* leave the work to lru_gen_age_node() */
+	if (current_is_kswapd())
+		return 0;
+
+	/* try other memcgs before going to the aging path */
+	if (!cgroup_reclaim(sc) && !sc->force_deactivate) {
+		sc->skipped_deactivate = true;
+		return 0;
+	}
+
+	inc_max_seq(lruvec, max_seq);
+
+	return nr_to_scan;
+}
+
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct blk_plug plug;
+	long scanned = 0;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	lru_add_drain();
+
+	blk_start_plug(&plug);
+
+	while (true) {
+		int delta;
+		int swappiness;
+		long nr_to_scan;
+
+		if (sc->may_swap)
+			swappiness = get_swappiness(memcg);
+		else if (!cgroup_reclaim(sc) && get_swappiness(memcg))
+			swappiness = 1;
+		else
+			swappiness = 0;
+
+		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
+		if (!nr_to_scan)
+			break;
+
+		delta = evict_folios(lruvec, sc, swappiness);
+		if (!delta)
+			break;
+
+		scanned += delta;
+		if (scanned >= nr_to_scan)
+			break;
+
+		cond_resched();
+	}
+
+	blk_finish_plug(&plug);
+}
+
 /******************************************************************************
  *                          initialization
  ******************************************************************************/
@@ -3123,6 +3882,16 @@  static int __init init_lru_gen(void)
 };
 late_initcall(init_lru_gen);
 
+#else
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+}
+
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -3136,6 +3905,11 @@  static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	struct blk_plug plug;
 	bool scan_adjusted;
 
+	if (lru_gen_enabled()) {
+		lru_gen_shrink_lruvec(lruvec, sc);
+		return;
+	}
+
 	get_scan_count(lruvec, sc, nr);
 
 	/* Record the original scan target for proportional adjustments later */
@@ -3640,6 +4414,9 @@  static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
 	struct lruvec *target_lruvec;
 	unsigned long refaults;
 
+	if (lru_gen_enabled())
+		return;
+
 	target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
 	target_lruvec->refaults[0] = refaults;
@@ -4010,6 +4787,11 @@  static void age_active_anon(struct pglist_data *pgdat,
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
+	if (lru_gen_enabled()) {
+		lru_gen_age_node(pgdat, sc);
+		return;
+	}
+
 	if (!can_age_anon_pages(pgdat, sc))
 		return;
 
diff --git a/mm/workingset.c b/mm/workingset.c
index 8c03afe1d67c..443343a3f3e3 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -187,7 +187,6 @@  static unsigned int bucket_order __read_mostly;
 static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
 			 bool workingset)
 {
-	eviction >>= bucket_order;
 	eviction &= EVICTION_MASK;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
@@ -212,10 +211,116 @@  static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 
 	*memcgidp = memcgid;
 	*pgdat = NODE_DATA(nid);
-	*evictionp = entry << bucket_order;
+	*evictionp = entry;
 	*workingsetp = workingset;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+static int folio_lru_refs(struct folio *folio)
+{
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
+
+	/* see the comment on MAX_NR_TIERS */
+	return flags & BIT(PG_workingset) ? (flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF : 0;
+}
+
+static void *lru_gen_eviction(struct folio *folio)
+{
+	int hist, tier;
+	unsigned long token;
+	unsigned long min_seq;
+	struct lruvec *lruvec;
+	struct lru_gen_struct *lrugen;
+	int type = folio_is_file_lru(folio);
+	int refs = folio_lru_refs(folio);
+	int delta = folio_nr_pages(folio);
+	bool workingset = folio_test_workingset(folio);
+	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct pglist_data *pgdat = folio_pgdat(folio);
+
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	lrugen = &lruvec->lrugen;
+	min_seq = READ_ONCE(lrugen->min_seq[type]);
+	token = (min_seq << LRU_REFS_WIDTH) | refs;
+
+	hist = lru_hist_from_seq(min_seq);
+	tier = lru_tier_from_refs(refs + workingset);
+	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
+
+	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
+}
+
+static void lru_gen_refault(struct folio *folio, void *shadow)
+{
+	int hist, tier, refs;
+	int memcg_id;
+	bool workingset;
+	unsigned long token;
+	unsigned long min_seq;
+	struct lruvec *lruvec;
+	struct lru_gen_struct *lrugen;
+	struct mem_cgroup *memcg;
+	struct pglist_data *pgdat;
+	int type = folio_is_file_lru(folio);
+	int delta = folio_nr_pages(folio);
+
+	unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset);
+
+	refs = token & (BIT(LRU_REFS_WIDTH) - 1);
+	if (refs && !workingset)
+		return;
+
+	if (folio_pgdat(folio) != pgdat)
+		return;
+
+	rcu_read_lock();
+	memcg = folio_memcg_rcu(folio);
+	if (mem_cgroup_id(memcg) != memcg_id)
+		goto unlock;
+
+	token >>= LRU_REFS_WIDTH;
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	lrugen = &lruvec->lrugen;
+	min_seq = READ_ONCE(lrugen->min_seq[type]);
+	if (token != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)))
+		goto unlock;
+
+	hist = lru_hist_from_seq(min_seq);
+	tier = lru_tier_from_refs(refs + workingset);
+	atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
+	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
+
+	/*
+	 * Count the following two cases as stalls:
+	 * 1) For pages accessed through page tables, hotter pages pushed out
+	 *    hot pages which refaulted immediately.
+	 * 2) For pages accessed through file descriptors, numbers of accesses
+	 *    might have been beyond the limit.
+	 */
+	if (lru_gen_in_fault() || refs + workingset == BIT(LRU_REFS_WIDTH)) {
+		folio_set_workingset(folio);
+		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
+	}
+unlock:
+	rcu_read_unlock();
+}
+
+#else
+
+static void *lru_gen_eviction(struct folio *folio)
+{
+	return NULL;
+}
+
+static void lru_gen_refault(struct folio *folio, void *shadow)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 /**
  * workingset_age_nonresident - age non-resident entries as LRU ages
  * @lruvec: the lruvec that was aged
@@ -264,10 +369,14 @@  void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
+	if (lru_gen_enabled())
+		return lru_gen_eviction(page_folio(page));
+
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
 	eviction = atomic_long_read(&lruvec->nonresident_age);
+	eviction >>= bucket_order;
 	workingset_age_nonresident(lruvec, thp_nr_pages(page));
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
@@ -297,7 +406,13 @@  void workingset_refault(struct folio *folio, void *shadow)
 	int memcgid;
 	long nr;
 
+	if (lru_gen_enabled()) {
+		lru_gen_refault(folio, shadow);
+		return;
+	}
+
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
+	eviction <<= bucket_order;
 
 	rcu_read_lock();
 	/*