diff mbox series

[v9,06/14] mm: multi-gen LRU: minimal implementation

Message ID 20220309021230.721028-7-yuzhao@google.com (mailing list archive)
State New, archived
Headers show
Series Multi-Gen LRU Framework | expand

Commit Message

Yu Zhao March 9, 2022, 2:12 a.m. UTC
To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.

The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().

The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.

Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
folio->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth protecting in the
   eviction path.
2. It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[47, 49]%
                IOPS         BW
      5.17-rc2: 2242k        8759MiB/s
      patch1-5: 3321k        12.7GiB/s

  Single workload:
    memcached (anon): +[101, 105]%
                Ops/sec      KB/sec
      5.17-rc2: 476771.79    18544.31
      patch1-5: 972526.07    37826.95

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was only used as a ram disk to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<EOF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    EOF

    cat >>/etc/memcached.conf <<EOF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    EOF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.17-rc2
      38.05%  page_vma_mapped_walk
      20.86%  lzo1x_1_do_compress (real work)
       6.16%  do_raw_spin_lock
       4.61%  _raw_spin_unlock_irq
       2.20%  vma_interval_tree_iter_next
       2.19%  vma_interval_tree_subtree_search
       2.15%  page_referenced_one
       1.93%  anon_vma_interval_tree_iter_first
       1.65%  ptep_clear_flush
       1.00%  __zram_bvec_write

    patch1-5
      39.73%  lzo1x_1_do_compress (real work)
      14.96%  page_vma_mapped_walk
       6.97%  _raw_spin_unlock_irq
       3.07%  do_raw_spin_lock
       2.53%  anon_vma_interval_tree_iter_first
       2.04%  ptep_clear_flush
       1.82%  __zram_bvec_write
       1.76%  __anon_vma_interval_tree_subtree_search
       1.57%  memmove
       1.45%  free_unref_page_list

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    Chrome OS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 include/linux/mm.h        |   1 +
 include/linux/mm_inline.h |  24 ++
 include/linux/mmzone.h    |  42 ++
 kernel/bounds.c           |   2 +-
 mm/Kconfig                |   9 +
 mm/swap.c                 |  42 ++
 mm/vmscan.c               | 786 +++++++++++++++++++++++++++++++++++++-
 mm/workingset.c           | 119 +++++-
 8 files changed, 1021 insertions(+), 4 deletions(-)

Comments

Huang, Ying March 16, 2022, 5:55 a.m. UTC | #1
Hi, Yu,

Yu Zhao <yuzhao@google.com> writes:

[snip]

>  
> +static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
> +{
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +	if (!can_demote(pgdat->node_id, sc) &&
> +	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
> +		return 0;
> +
> +	return mem_cgroup_swappiness(memcg);
> +}
> +

We have tested v9 for memory tiering system, the demotion works now even
without swap devices configured.  Thanks!

And we found that the demotion (page reclaiming on DRAM nodes) speed is
lower than the original implementation.  The workload itself is just a
memory accessing micro-benchmark with Gauss distribution.  It is run on
a system with DRAM and PMEM.  Initially, quite some hot pages are placed
in PMEM and quite some cold pages are placed in DRAM.  Then the page
placement optimizing mechanism based on NUMA balancing will try to
promote some hot pages from PMEM node to DRAM node.  If the DRAM node
near full (reach high watermark), kswapd of the DRAM node will be woke
up to demote (reclaim) some cold DRAM pages to PMEM.  Because quite some
pages on DRAM is very cold (not accessed for at least several seconds),
the benchmark performance will be better if demotion speed is faster.

Some data comes from /proc/vmstat and perf-profile is as follows.

From /proc/vmstat, it seems that the page scanned and page demoted is
much less with MGLRU enabled.  The pgdemote_kswapd / pgscan_kswapd is
5.22 times higher with MGLRU enabled than that with MGLRU disabled.  I
think this shows the value of direct page table scanning.

From perf-profile, the CPU cycles for kswapd is same.  But less pages
are demoted (reclaimed) with MGLRU.  And it appears that the total page
table scanning time of MGLRU is longer if we compare walk_page_range
(1.97%, MGLRU enabled) and page_referenced (0.54%, MGLRU disabled)?
Because we only demote (reclaim) from DRAM nodes, but not demote
(reclaim) from PMEM nodes and bloom filter doesn't work well enough?
One thing that may be not friendly for bloom filter is that some virtual
pages may change their resident nodes because of demotion/promotion.

Can you teach me to how interpret these data for MGLRU?  Or can you
point me to the other/better data for MGLRU?

MGLRU disabled via: echo -n 0 > /sys/kernel/mm/lru_gen/enabled
--------------------------------------------------------------

/proc/vmstat:

pgactivate 1767172340
pgdeactivate 1740111896
pglazyfree 0
pgfault 583875828
pgmajfault 0
pglazyfreed 0
pgrefill 1740111896
pgreuse 22626572
pgsteal_kswapd 153796237
pgsteal_direct 1999
pgdemote_kswapd 153796237
pgdemote_direct 1999
pgscan_kswapd 2055504891
pgscan_direct 1999
pgscan_direct_throttle 0
pgscan_anon 2055356614
pgscan_file 150276
pgsteal_anon 153798203
pgsteal_file 33
zone_reclaim_failed 0
pginodesteal 0
slabs_scanned 82761
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 2960
kswapd_high_wmark_hit_quickly 17732
pageoutrun 21583
pgrotated 0
drop_pagecache 0
drop_slab 0
oom_kill 0
numa_pte_updates 515994024
numa_huge_pte_updates 154
numa_hint_faults 498301236
numa_hint_faults_local 121109067
numa_pages_migrated 152650705
pgmigrate_success 307213704
pgmigrate_fail 39
thp_migration_success 93
thp_migration_fail 0
thp_migration_split 0

perf-profile:

kswapd.kthread.ret_from_fork: 2.86
balance_pgdat.kswapd.kthread.ret_from_fork: 2.86
shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 2.85
shrink_lruvec.shrink_node.balance_pgdat.kswapd.kthread: 2.76
shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 1.9
shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat: 1.52
shrink_active_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 0.85
migrate_pages.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.79
page_referenced.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.54


MGLRU enabled via: echo -n 7 > /sys/kernel/mm/lru_gen/enabled
-------------------------------------------------------------

/proc/vmstat:

pgactivate 47212585
pgdeactivate 0
pglazyfree 0
pgfault 580056521
pgmajfault 0
pglazyfreed 0
pgrefill 6911868880
pgreuse 25108929
pgsteal_kswapd 32701609
pgsteal_direct 0
pgdemote_kswapd 32701609
pgdemote_direct 0
pgscan_kswapd 83582770
pgscan_direct 0
pgscan_direct_throttle 0
pgscan_anon 83549777
pgscan_file 32993
pgsteal_anon 32701576
pgsteal_file 33
zone_reclaim_failed 0
pginodesteal 0
slabs_scanned 84829
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 313
kswapd_high_wmark_hit_quickly 5262
pageoutrun 5895
pgrotated 0
drop_pagecache 0
drop_slab 0
oom_kill 0
numa_pte_updates 512084786
numa_huge_pte_updates 198
numa_hint_faults 494583387
numa_hint_faults_local 129411334
numa_pages_migrated 34165992
pgmigrate_success 67833977
pgmigrate_fail 7
thp_migration_success 135
thp_migration_fail 0
thp_migration_split 0

perf-profile:

kswapd.kthread.ret_from_fork: 2.86
balance_pgdat.kswapd.kthread.ret_from_fork: 2.86
lru_gen_age_node.balance_pgdat.kswapd.kthread.ret_from_fork: 1.97
walk_page_range.try_to_inc_max_seq.lru_gen_age_node.balance_pgdat.kswapd: 1.97
shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 0.89
evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node.balance_pgdat: 0.89
scan_folios.evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node: 0.66

Best Regards,
Huang, Ying

[snip]
Yu Zhao March 16, 2022, 7:54 a.m. UTC | #2
On Tue, Mar 15, 2022 at 11:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Yu,
>
> Yu Zhao <yuzhao@google.com> writes:
>
> [snip]
>
> >
> > +static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
> > +{
> > +     struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +     struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > +
> > +     if (!can_demote(pgdat->node_id, sc) &&
> > +         mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
> > +             return 0;
> > +
> > +     return mem_cgroup_swappiness(memcg);
> > +}
> > +
>
> We have tested v9 for memory tiering system, the demotion works now even
> without swap devices configured.  Thanks!

Admittedly I didn't test it :) So thanks for testing -- I'm glad to
hear it didn't fall apart.

> And we found that the demotion (page reclaiming on DRAM nodes) speed is
> lower than the original implementation.

This sounds like an improvement to me, assuming the initial hot/cold
memory placements were similar for both the baseline and MGLRU.

Correct me if I'm wrong: since demotion is driven by promotion, lower
demotion speed means hot and cold pages were sorted out to DRAM and
AEP at a faster speed, hence an improvement.

# promotion path:
numa_hint_faults    498301236
numa_pages_migrated 152650705

numa_hint_faults    494583387
numa_pages_migrated 34165992

# demotion path:
pgsteal_anon 153798203
pgsteal_file 33

pgsteal_anon 32701576
pgsteal_file 33

The hint faults are similar but MGLRU has much fewer migrated -- my
guess is it demoted much fewer hot/warm pages and therefore led to
less work on the promotion path.

>  The workload itself is just a
> memory accessing micro-benchmark with Gauss distribution.  It is run on
> a system with DRAM and PMEM.  Initially, quite some hot pages are placed
> in PMEM and quite some cold pages are placed in DRAM.  Then the page
> placement optimizing mechanism based on NUMA balancing will try to
> promote some hot pages from PMEM node to DRAM node.

My understanding seems to be correct?

>  If the DRAM node
> near full (reach high watermark), kswapd of the DRAM node will be woke
> up to demote (reclaim) some cold DRAM pages to PMEM.  Because quite some
> pages on DRAM is very cold (not accessed for at least several seconds),
> the benchmark performance will be better if demotion speed is faster.

I'm confused. It seems to me demotion speed is irrelevant. The time to
reach the equilibrium is what we want to measure.

> Some data comes from /proc/vmstat and perf-profile is as follows.
>
> From /proc/vmstat, it seems that the page scanned and page demoted is
> much less with MGLRU enabled.  The pgdemote_kswapd / pgscan_kswapd is
> 5.22 times higher with MGLRU enabled than that with MGLRU disabled.  I
> think this shows the value of direct page table scanning.

Can't disagree :)

> From perf-profile, the CPU cycles for kswapd is same.  But less pages
> are demoted (reclaimed) with MGLRU.  And it appears that the total page
> table scanning time of MGLRU is longer if we compare walk_page_range
> (1.97%, MGLRU enabled) and page_referenced (0.54%, MGLRU disabled)?

It's possible if the address space is very large and sparse. But once
MGLRU warms up, it should detect it and fall back to
page_referenced().

> Because we only demote (reclaim) from DRAM nodes, but not demote
> (reclaim) from PMEM nodes and bloom filter doesn't work well enough?

The bloom filters are per lruvec. So this should affect them.

> One thing that may be not friendly for bloom filter is that some virtual
> pages may change their resident nodes because of demotion/promotion.

Yes, it's possible.

> Can you teach me to how interpret these data for MGLRU?  Or can you
> point me to the other/better data for MGLRU?

You are the expert :)

My current understanding is that this is an improvement. IOW, with
MGLRU, DRAM (hot) <-> AEP (cold) reached equilibrium a lot faster.


> MGLRU disabled via: echo -n 0 > /sys/kernel/mm/lru_gen/enabled
> --------------------------------------------------------------
>
> /proc/vmstat:
>
> pgactivate 1767172340
> pgdeactivate 1740111896
> pglazyfree 0
> pgfault 583875828
> pgmajfault 0
> pglazyfreed 0
> pgrefill 1740111896
> pgreuse 22626572
> pgsteal_kswapd 153796237
> pgsteal_direct 1999
> pgdemote_kswapd 153796237
> pgdemote_direct 1999
> pgscan_kswapd 2055504891
> pgscan_direct 1999
> pgscan_direct_throttle 0
> pgscan_anon 2055356614
> pgscan_file 150276
> pgsteal_anon 153798203
> pgsteal_file 33
> zone_reclaim_failed 0
> pginodesteal 0
> slabs_scanned 82761
> kswapd_inodesteal 0
> kswapd_low_wmark_hit_quickly 2960
> kswapd_high_wmark_hit_quickly 17732
> pageoutrun 21583
> pgrotated 0
> drop_pagecache 0
> drop_slab 0
> oom_kill 0
> numa_pte_updates 515994024
> numa_huge_pte_updates 154
> numa_hint_faults 498301236
> numa_hint_faults_local 121109067
> numa_pages_migrated 152650705
> pgmigrate_success 307213704
> pgmigrate_fail 39
> thp_migration_success 93
> thp_migration_fail 0
> thp_migration_split 0
>
> perf-profile:
>
> kswapd.kthread.ret_from_fork: 2.86
> balance_pgdat.kswapd.kthread.ret_from_fork: 2.86
> shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 2.85
> shrink_lruvec.shrink_node.balance_pgdat.kswapd.kthread: 2.76
> shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 1.9
> shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat: 1.52
> shrink_active_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 0.85
> migrate_pages.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.79
> page_referenced.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.54
>
>
> MGLRU enabled via: echo -n 7 > /sys/kernel/mm/lru_gen/enabled
> -------------------------------------------------------------
>
> /proc/vmstat:
>
> pgactivate 47212585
> pgdeactivate 0
> pglazyfree 0
> pgfault 580056521
> pgmajfault 0
> pglazyfreed 0
> pgrefill 6911868880
> pgreuse 25108929
> pgsteal_kswapd 32701609
> pgsteal_direct 0
> pgdemote_kswapd 32701609
> pgdemote_direct 0
> pgscan_kswapd 83582770
> pgscan_direct 0
> pgscan_direct_throttle 0
> pgscan_anon 83549777
> pgscan_file 32993
> pgsteal_anon 32701576
> pgsteal_file 33
> zone_reclaim_failed 0
> pginodesteal 0
> slabs_scanned 84829
> kswapd_inodesteal 0
> kswapd_low_wmark_hit_quickly 313
> kswapd_high_wmark_hit_quickly 5262
> pageoutrun 5895
> pgrotated 0
> drop_pagecache 0
> drop_slab 0
> oom_kill 0
> numa_pte_updates 512084786
> numa_huge_pte_updates 198
> numa_hint_faults 494583387
> numa_hint_faults_local 129411334
> numa_pages_migrated 34165992
> pgmigrate_success 67833977
> pgmigrate_fail 7
> thp_migration_success 135
> thp_migration_fail 0
> thp_migration_split 0
>
> perf-profile:
>
> kswapd.kthread.ret_from_fork: 2.86
> balance_pgdat.kswapd.kthread.ret_from_fork: 2.86
> lru_gen_age_node.balance_pgdat.kswapd.kthread.ret_from_fork: 1.97
> walk_page_range.try_to_inc_max_seq.lru_gen_age_node.balance_pgdat.kswapd: 1.97
> shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 0.89
> evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node.balance_pgdat: 0.89
> scan_folios.evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node: 0.66
>
> Best Regards,
> Huang, Ying
>
> [snip]
>
Barry Song March 19, 2022, 3:01 a.m. UTC | #3
> +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       unsigned long old_flags, new_flags;
> +       int type = folio_is_file_lru(folio);
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> +
> +       do {
> +               new_flags = old_flags = READ_ONCE(folio->flags);
> +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> +
> +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> +               new_gen = (old_gen + 1) % MAX_NR_GENS;

new_gen is assigned twice, i assume you mean
               old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
               new_gen = (old_gen + 1) % MAX_NR_GENS;

or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?

> +
> +               new_flags &= ~LRU_GEN_MASK;
> +               new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
> +               new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
> +               /* for folio_end_writeback() */
> +               if (reclaiming)
> +                       new_flags |= BIT(PG_reclaim);
> +       } while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> +
> +       lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> +
> +       return new_gen;
> +}
> +

Thanks
Barry
Yu Zhao March 19, 2022, 3:11 a.m. UTC | #4
On Fri, Mar 18, 2022 at 9:01 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > +{
> > +       unsigned long old_flags, new_flags;
> > +       int type = folio_is_file_lru(folio);
> > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > +
> > +       do {
> > +               new_flags = old_flags = READ_ONCE(folio->flags);
> > +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> > +
> > +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > +               new_gen = (old_gen + 1) % MAX_NR_GENS;
>
> new_gen is assigned twice, i assume you mean
>                old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
>                new_gen = (old_gen + 1) % MAX_NR_GENS;
>
> or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?

Thanks a lot for your attention to details!

The first line should be in the next patch but I overlooked during the
last refactoring:

  new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+ /* folio_update_gen() has promoted this page? */
+ if (new_gen >= 0 && new_gen != old_gen)
+ return new_gen;
+
  new_gen = (old_gen + 1) % MAX_NR_GENS;
Barry Song March 19, 2022, 10:14 a.m. UTC | #5
> +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
> +{
> +       int prev, next;
> +       int type, zone;
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       spin_lock_irq(&lruvec->lru_lock);
> +
> +       VM_BUG_ON(!seq_is_valid(lruvec));
> +
> +       if (max_seq != lrugen->max_seq)
> +               goto unlock;
> +
> +       inc_min_seq(lruvec);
> +
> +       /* update the active/inactive LRU sizes for compatibility */
> +       prev = lru_gen_from_seq(lrugen->max_seq - 1);
> +       next = lru_gen_from_seq(lrugen->max_seq + 1);
> +
> +       for (type = 0; type < ANON_AND_FILE; type++) {
> +               for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +                       enum lru_list lru = type * LRU_INACTIVE_FILE;
> +                       long delta = lrugen->nr_pages[prev][type][zone] -
> +                                    lrugen->nr_pages[next][type][zone];

this is confusing to me. does lrugen->nr_pages[next][type][zone] have a
chance to be none-zero even before max_seq is increased? some pages
can be in the next generation before the generation is born?

isn't it a bug if(lrugen->nr_pages[next][type][zone] > 0)? shouldn't it be?

delta = lrugen->nr_pages[prev][type][zone];

> +
> +                       if (!delta)
> +                               continue;
> +
> +                       __update_lru_size(lruvec, lru, zone, delta);
> +                       __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
> +               }
> +       }
> +
> +       for (type = 0; type < ANON_AND_FILE; type++)
> +               reset_ctrl_pos(lruvec, type, false);
> +
> +       /* make sure preceding modifications appear */
> +       smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
> +unlock:
> +       spin_unlock_irq(&lruvec->lru_lock);
> +}

Thanks
Barry
Barry Song March 19, 2022, 11:15 a.m. UTC | #6
> +                            unsigned long *min_seq, bool can_swap, bool *need_aging)
> +{
> +       int gen, type, zone;
> +       long old = 0;
> +       long young = 0;
> +       long total = 0;
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +               unsigned long seq;
> +
> +               for (seq = min_seq[type]; seq <= max_seq; seq++) {
> +                       long size = 0;
> +
> +                       gen = lru_gen_from_seq(seq);
> +
> +                       for (zone = 0; zone < MAX_NR_ZONES; zone++)
> +                               size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
> +
> +                       total += size;
> +                       if (seq == max_seq)
> +                               young += size;
> +                       if (seq + MIN_NR_GENS == max_seq)
> +                               old += size;
> +               }
> +       }
> +
> +       /* try to spread pages out across MIN_NR_GENS+1 generations */
> +       if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
> +               *need_aging = true;
> +       else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
> +               *need_aging = false;
> +       else if (young * MIN_NR_GENS > total)
> +               *need_aging = true;

Could we have some doc here? Given MIN_NR_GENS=2 and MAX_NR_GENS=4,
it seems you mean if we have three generations and the youngest pages are more
than 1/2 of the total pages, we need aging?


> +       else if (old * (MIN_NR_GENS + 2) < total)
> +               *need_aging = true;

it seems you mean if the oldest pages are less than 1/4 of the total pages,
we need aging? Can we have comments to explain why here?

your commit message only says " The aging produces young generations.
Given an lruvec, it increments max_seq when max_seq-min_seq+1
approaches MIN_NR_GENS." it can't explain what the code is doing
here.


> +       else
> +               *need_aging = false;
> +
> +       return total > 0 ? total : 0;
> +}

Thanks
Barry
Aneesh Kumar K.V March 21, 2022, 12:51 p.m. UTC | #7
+
> +static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
> +			     unsigned long *min_seq, bool can_swap, bool *need_aging)
> +{
> +	int gen, type, zone;
> +	long old = 0;
> +	long young = 0;
> +	long total = 0;
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +	for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +		unsigned long seq;
> +
> +		for (seq = min_seq[type]; seq <= max_seq; seq++) {
> +			long size = 0;
> +
> +			gen = lru_gen_from_seq(seq);
> +
> +			for (zone = 0; zone < MAX_NR_ZONES; zone++)
> +				size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
> +
> +			total += size;
> +			if (seq == max_seq)
> +				young += size;
> +			if (seq + MIN_NR_GENS == max_seq)
> +				old += size;
> +		}
> +	}
> +
> +	/* try to spread pages out across MIN_NR_GENS+1 generations */
> +	if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
> +		*need_aging = true;
> +	else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
> +		*need_aging = false;

Can you explain/document the reason for the considering the below
conditions for ageing? 

> +	else if (young * MIN_NR_GENS > total)
> +		*need_aging = true;

Are we trying to consdier the case of more than half the total pages
young as needing ageing? If so should MIN_NR_GENS be 2 instead of using
that #define? Or 

> +	else if (old * (MIN_NR_GENS + 2) < total)
> +		*need_aging = true;

What is the significance of '+ 2' ? 

> +	else
> +		*need_aging = false;
> +
> +	return total > 0 ? total : 0;
> +}
> +
Aneesh Kumar K.V March 21, 2022, 1:01 p.m. UTC | #8
Yu Zhao <yuzhao@google.com> writes:

> To avoid confusion, the terms "promotion" and "demotion" will be
> applied to the multi-gen LRU, as a new convention; the terms
> "activation" and "deactivation" will be applied to the active/inactive
> LRU, as usual.
>
> The aging produces young generations. Given an lruvec, it increments
> max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
> promotes hot pages to the youngest generation when it finds them
> accessed through page tables; the demotion of cold pages happens
> consequently when it increments max_seq. The aging has the complexity
> O(nr_hot_pages), since it is only interested in hot pages. Promotion
> in the aging path does not require any LRU list operations, only the
> updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
> the result of the increment of max_seq, requires LRU list operations,
> e.g., lru_deactivate_fn().
>
> The eviction consumes old generations. Given an lruvec, it increments
> min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
> feedback loop modeled after the PID controller monitors refaults over
> anon and file types and decides which type to evict when both types
> are available from the same generation.
>
> Each generation is divided into multiple tiers. Tiers represent
> different ranges of numbers of accesses through file descriptors. A
> page accessed N times through file descriptors is in tier
> order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
> bits in folio->flags. In contrast to moving across generations, which
> requires the LRU lock, moving across tiers only involves operations on
> folio->flags. The feedback loop also monitors refaults over all tiers
> and decides when to protect pages in which tiers (N>1), using the
> first tier (N=0,1) as a baseline. The first tier contains single-use
> unmapped clean pages, which are most likely the best choices. The
> eviction moves a page to the next generation, i.e., min_seq+1, if the
> feedback loop decides so. This approach has the following advantages:
> 1. It removes the cost of activation in the buffered access path by
>    inferring whether pages accessed multiple times through file
>    descriptors are statistically hot and thus worth protecting in the
>    eviction path.
> 2. It takes pages accessed through page tables into account and avoids
>    overprotecting pages accessed multiple times through file
>    descriptors. (Pages accessed through page tables are in the first
>    tier, since N=0.)
> 3. More tiers provide better protection for pages accessed more than
>    twice through file descriptors, when under heavy buffered I/O
>    workloads.
>
> Server benchmark results:
>   Single workload:
>     fio (buffered I/O): +[47, 49]%
>                 IOPS         BW
>       5.17-rc2: 2242k        8759MiB/s
>       patch1-5: 3321k        12.7GiB/s
>
>   Single workload:
>     memcached (anon): +[101, 105]%
>                 Ops/sec      KB/sec
>       5.17-rc2: 476771.79    18544.31
>       patch1-5: 972526.07    37826.95
>
>   Configurations:
>     CPU: two Xeon 6154
>     Mem: total 256G
>
>     Node 1 was only used as a ram disk to reduce the variance in the
>     results.
>
>     patch drivers/block/brd.c <<EOF
>     99,100c99,100
>     < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
>     < 	page = alloc_page(gfp_flags);
>     ---
>     > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
>     > 	page = alloc_pages_node(1, gfp_flags, 0);
>     EOF
>
>     cat >>/etc/systemd/system.conf <<EOF
>     CPUAffinity=numa
>     NUMAPolicy=bind
>     NUMAMask=0
>     EOF
>
>     cat >>/etc/memcached.conf <<EOF
>     -m 184320
>     -s /var/run/memcached/memcached.sock
>     -a 0766
>     -t 36
>     -B binary
>     EOF
>
>     cat fio.sh
>     modprobe brd rd_nr=1 rd_size=113246208
>     mkfs.ext4 /dev/ram0
>     mount -t ext4 /dev/ram0 /mnt
>
>     mkdir /sys/fs/cgroup/user.slice/test
>     echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
>     echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
>     fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
>       --buffered=1 --ioengine=io_uring --iodepth=128 \
>       --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
>       --rw=randread --random_distribution=random --norandommap \
>       --time_based --ramp_time=10m --runtime=5m --group_reporting
>
>     cat memcached.sh
>     modprobe brd rd_nr=1 rd_size=113246208
>     swapoff -a
>     mkswap /dev/ram0
>     swapon /dev/ram0
>
>     memtier_benchmark -S /var/run/memcached/memcached.sock \
>       -P memcache_binary -n allkeys --key-minimum=1 \
>       --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
>       --ratio 1:0 --pipeline 8 -d 2000
>
>     memtier_benchmark -S /var/run/memcached/memcached.sock \
>       -P memcache_binary -n allkeys --key-minimum=1 \
>       --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
>       --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
>
> Client benchmark results:
>   kswapd profiles:
>     5.17-rc2
>       38.05%  page_vma_mapped_walk
>       20.86%  lzo1x_1_do_compress (real work)
>        6.16%  do_raw_spin_lock
>        4.61%  _raw_spin_unlock_irq
>        2.20%  vma_interval_tree_iter_next
>        2.19%  vma_interval_tree_subtree_search
>        2.15%  page_referenced_one
>        1.93%  anon_vma_interval_tree_iter_first
>        1.65%  ptep_clear_flush
>        1.00%  __zram_bvec_write
>
>     patch1-5
>       39.73%  lzo1x_1_do_compress (real work)
>       14.96%  page_vma_mapped_walk
>        6.97%  _raw_spin_unlock_irq
>        3.07%  do_raw_spin_lock
>        2.53%  anon_vma_interval_tree_iter_first
>        2.04%  ptep_clear_flush
>        1.82%  __zram_bvec_write
>        1.76%  __anon_vma_interval_tree_subtree_search
>        1.57%  memmove
>        1.45%  free_unref_page_list
>
>   Configurations:
>     CPU: single Snapdragon 7c
>     Mem: total 4G
>
>     Chrome OS MemoryPressure [1]
>
> [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
>

In shrink_active_list we do preferential treatment of VM_EXEC pages.
Do we do similar thing with MGLRU? if not why is that not needed? 

	if (page_referenced(page, 0, sc->target_mem_cgroup,
			    &vm_flags)) {
		/*
		 * Identify referenced, file-backed active pages and
		 * give them one more trip around the active list. So
		 * that executable code get better chances to stay in
		 * memory under moderate memory pressure.  Anon pages
		 * are not likely to be evicted by use-once streaming
		 * IO, plus JVM can create lots of anon VM_EXEC pages,
		 * so we ignore them here.
		 */
		if ((vm_flags & VM_EXEC) && page_is_file_lru(page)) {
			nr_rotated += thp_nr_pages(page);
			list_add(&page->lru, &l_active);
			continue;
		}
	}
Yu Zhao March 21, 2022, 11:51 p.m. UTC | #9
On Sat, Mar 19, 2022 at 4:14 AM Barry Song <21cnbao@gmail.com> wrote:
>
> > +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
> > +{
> > +       int prev, next;
> > +       int type, zone;
> > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +       spin_lock_irq(&lruvec->lru_lock);
> > +
> > +       VM_BUG_ON(!seq_is_valid(lruvec));
> > +
> > +       if (max_seq != lrugen->max_seq)
> > +               goto unlock;
> > +
> > +       inc_min_seq(lruvec);
> > +
> > +       /* update the active/inactive LRU sizes for compatibility */
> > +       prev = lru_gen_from_seq(lrugen->max_seq - 1);
> > +       next = lru_gen_from_seq(lrugen->max_seq + 1);
> > +
> > +       for (type = 0; type < ANON_AND_FILE; type++) {
> > +               for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> > +                       enum lru_list lru = type * LRU_INACTIVE_FILE;
> > +                       long delta = lrugen->nr_pages[prev][type][zone] -
> > +                                    lrugen->nr_pages[next][type][zone];
>
> this is confusing to me. does lrugen->nr_pages[next][type][zone] have a
> chance to be none-zero even before max_seq is increased? some pages
> can be in the next generation before the generation is born?

Yes.

> isn't it a bug if(lrugen->nr_pages[next][type][zone] > 0)? shouldn't it be?
>
> delta = lrugen->nr_pages[prev][type][zone];

No. The gen counter in page flags can be updated locklessly
(lru_lock). Later a batched update of nr_pages[] will account for the
change made. If the gen counter is updated to a stale max_seq, and
this stale max_seq is less than min_seq, then this page will be in a
generation yet to be born. Extremely unlikely, but still possible.

This is not a bug because pages might be misplaced but they won't be
lost. IOW, nr_pages[] is always balanced across all *possible*
generations. For the same reason, reset_batch_size() and
drain_evictable() use for_each_gen_type_zone() to go through all
possible generations rather than only those between[max_seq, min_seq].

I'll add a comment here. Sounds good?
Yu Zhao March 22, 2022, 12:30 a.m. UTC | #10
On Sat, Mar 19, 2022 at 5:15 AM Barry Song <21cnbao@gmail.com> wrote:
>
> > +                            unsigned long *min_seq, bool can_swap, bool *need_aging)
> > +{
> > +       int gen, type, zone;
> > +       long old = 0;
> > +       long young = 0;
> > +       long total = 0;
> > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > +               unsigned long seq;
> > +
> > +               for (seq = min_seq[type]; seq <= max_seq; seq++) {
> > +                       long size = 0;
> > +
> > +                       gen = lru_gen_from_seq(seq);
> > +
> > +                       for (zone = 0; zone < MAX_NR_ZONES; zone++)
> > +                               size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
> > +
> > +                       total += size;
> > +                       if (seq == max_seq)
> > +                               young += size;
> > +                       if (seq + MIN_NR_GENS == max_seq)
> > +                               old += size;
> > +               }
> > +       }
> > +
> > +       /* try to spread pages out across MIN_NR_GENS+1 generations */
> > +       if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
> > +               *need_aging = true;
> > +       else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
> > +               *need_aging = false;
> > +       else if (young * MIN_NR_GENS > total)
> > +               *need_aging = true;
>
> Could we have some doc here?

Will do.

> Given MIN_NR_GENS=2 and MAX_NR_GENS=4,
> it seems you mean if we have three generations and the youngest pages are more
> than 1/2 of the total pages, we need aging?

Yes.

> > +       else if (old * (MIN_NR_GENS + 2) < total)
> > +               *need_aging = true;
>
> it seems you mean if the oldest pages are less than 1/4 of the total pages,
> we need aging?

Yes.

> Can we have comments to explain why here?
>
> your commit message only says " The aging produces young generations.
> Given an lruvec, it increments max_seq when max_seq-min_seq+1
> approaches MIN_NR_GENS." it can't explain what the code is doing
> here.

Fair enough. Approaching MIN_NR_GENS=2 means getting close to it. From
the consumer's POV, if it *reaches* 2, the eviction will have to
stall, because the two youngest generations are not yet fully aged,
i.e., the second chance policy similar to the active/inactive lru.
From the producer's POV, the aging tries to be lazy to reduce the
overhead. So ideally, we want 3 generations, which gives a reasonable
range [2, 4], hence the first two if's.

In addition, we want pages to spread out evenly over these 3
generations, meaning an average 1/3 of total pages for each
generation, which gives another reasonable range [1/2, 1/4]. Since the
eviction reduces the number of old pages, we only need to check
against the lower bound, i.e., 1/4. On the other hand, page (re)faults
increase the number of young pages, so in this case, we need to check
against the upper bound.

I'll include these details in the next spin.
Yu Zhao March 22, 2022, 4:02 a.m. UTC | #11
On Mon, Mar 21, 2022 at 6:52 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
>  +
> > +static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
> > +                          unsigned long *min_seq, bool can_swap, bool *need_aging)
> > +{
> > +     int gen, type, zone;
> > +     long old = 0;
> > +     long young = 0;
> > +     long total = 0;
> > +     struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +     for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > +             unsigned long seq;
> > +
> > +             for (seq = min_seq[type]; seq <= max_seq; seq++) {
> > +                     long size = 0;
> > +
> > +                     gen = lru_gen_from_seq(seq);
> > +
> > +                     for (zone = 0; zone < MAX_NR_ZONES; zone++)
> > +                             size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
> > +
> > +                     total += size;
> > +                     if (seq == max_seq)
> > +                             young += size;
> > +                     if (seq + MIN_NR_GENS == max_seq)
> > +                             old += size;
> > +             }
> > +     }
> > +
> > +     /* try to spread pages out across MIN_NR_GENS+1 generations */
> > +     if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
> > +             *need_aging = true;
> > +     else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
> > +             *need_aging = false;
>
> Can you explain/document the reason for the considering the below
> conditions for ageing?
>
> > +     else if (young * MIN_NR_GENS > total)
> > +             *need_aging = true;
>
> Are we trying to consdier the case of more than half the total pages
> young as needing ageing? If so should MIN_NR_GENS be 2 instead of using
> that #define? Or
>
> > +     else if (old * (MIN_NR_GENS + 2) < total)
> > +             *need_aging = true;
>
> What is the significance of '+ 2' ?

Will improve the comment according to my previous reply here [1].

[1] https://lore.kernel.org/linux-mm/CAOUHufYmUPZY0gCC+wYk6Vr1L8KEx+tJeEAhjpBfUnLJsAHq5A@mail.gmail.com/
Yu Zhao March 22, 2022, 4:39 a.m. UTC | #12
On Mon, Mar 21, 2022 at 7:01 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > To avoid confusion, the terms "promotion" and "demotion" will be
> > applied to the multi-gen LRU, as a new convention; the terms
> > "activation" and "deactivation" will be applied to the active/inactive
> > LRU, as usual.
> >
> > The aging produces young generations. Given an lruvec, it increments
> > max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
> > promotes hot pages to the youngest generation when it finds them
> > accessed through page tables; the demotion of cold pages happens
> > consequently when it increments max_seq. The aging has the complexity
> > O(nr_hot_pages), since it is only interested in hot pages. Promotion
> > in the aging path does not require any LRU list operations, only the
> > updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
> > the result of the increment of max_seq, requires LRU list operations,
> > e.g., lru_deactivate_fn().
> >
> > The eviction consumes old generations. Given an lruvec, it increments
> > min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
> > feedback loop modeled after the PID controller monitors refaults over
> > anon and file types and decides which type to evict when both types
> > are available from the same generation.
> >
> > Each generation is divided into multiple tiers. Tiers represent
> > different ranges of numbers of accesses through file descriptors. A
> > page accessed N times through file descriptors is in tier
> > order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
> > bits in folio->flags. In contrast to moving across generations, which
> > requires the LRU lock, moving across tiers only involves operations on
> > folio->flags. The feedback loop also monitors refaults over all tiers
> > and decides when to protect pages in which tiers (N>1), using the
> > first tier (N=0,1) as a baseline. The first tier contains single-use
> > unmapped clean pages, which are most likely the best choices. The
> > eviction moves a page to the next generation, i.e., min_seq+1, if the
> > feedback loop decides so. This approach has the following advantages:
> > 1. It removes the cost of activation in the buffered access path by
> >    inferring whether pages accessed multiple times through file
> >    descriptors are statistically hot and thus worth protecting in the
> >    eviction path.
> > 2. It takes pages accessed through page tables into account and avoids
> >    overprotecting pages accessed multiple times through file
> >    descriptors. (Pages accessed through page tables are in the first
> >    tier, since N=0.)
> > 3. More tiers provide better protection for pages accessed more than
> >    twice through file descriptors, when under heavy buffered I/O
> >    workloads.
> >
> > Server benchmark results:
> >   Single workload:
> >     fio (buffered I/O): +[47, 49]%
> >                 IOPS         BW
> >       5.17-rc2: 2242k        8759MiB/s
> >       patch1-5: 3321k        12.7GiB/s
> >
> >   Single workload:
> >     memcached (anon): +[101, 105]%
> >                 Ops/sec      KB/sec
> >       5.17-rc2: 476771.79    18544.31
> >       patch1-5: 972526.07    37826.95
> >
> >   Configurations:
> >     CPU: two Xeon 6154
> >     Mem: total 256G
> >
> >     Node 1 was only used as a ram disk to reduce the variance in the
> >     results.
> >
> >     patch drivers/block/brd.c <<EOF
> >     99,100c99,100
> >     <         gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
> >     <         page = alloc_page(gfp_flags);
> >     ---
> >     >         gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> >     >         page = alloc_pages_node(1, gfp_flags, 0);
> >     EOF
> >
> >     cat >>/etc/systemd/system.conf <<EOF
> >     CPUAffinity=numa
> >     NUMAPolicy=bind
> >     NUMAMask=0
> >     EOF
> >
> >     cat >>/etc/memcached.conf <<EOF
> >     -m 184320
> >     -s /var/run/memcached/memcached.sock
> >     -a 0766
> >     -t 36
> >     -B binary
> >     EOF
> >
> >     cat fio.sh
> >     modprobe brd rd_nr=1 rd_size=113246208
> >     mkfs.ext4 /dev/ram0
> >     mount -t ext4 /dev/ram0 /mnt
> >
> >     mkdir /sys/fs/cgroup/user.slice/test
> >     echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
> >     echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
> >     fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
> >       --buffered=1 --ioengine=io_uring --iodepth=128 \
> >       --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> >       --rw=randread --random_distribution=random --norandommap \
> >       --time_based --ramp_time=10m --runtime=5m --group_reporting
> >
> >     cat memcached.sh
> >     modprobe brd rd_nr=1 rd_size=113246208
> >     swapoff -a
> >     mkswap /dev/ram0
> >     swapon /dev/ram0
> >
> >     memtier_benchmark -S /var/run/memcached/memcached.sock \
> >       -P memcache_binary -n allkeys --key-minimum=1 \
> >       --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
> >       --ratio 1:0 --pipeline 8 -d 2000
> >
> >     memtier_benchmark -S /var/run/memcached/memcached.sock \
> >       -P memcache_binary -n allkeys --key-minimum=1 \
> >       --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
> >       --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
> >
> > Client benchmark results:
> >   kswapd profiles:
> >     5.17-rc2
> >       38.05%  page_vma_mapped_walk
> >       20.86%  lzo1x_1_do_compress (real work)
> >        6.16%  do_raw_spin_lock
> >        4.61%  _raw_spin_unlock_irq
> >        2.20%  vma_interval_tree_iter_next
> >        2.19%  vma_interval_tree_subtree_search
> >        2.15%  page_referenced_one
> >        1.93%  anon_vma_interval_tree_iter_first
> >        1.65%  ptep_clear_flush
> >        1.00%  __zram_bvec_write
> >
> >     patch1-5
> >       39.73%  lzo1x_1_do_compress (real work)
> >       14.96%  page_vma_mapped_walk
> >        6.97%  _raw_spin_unlock_irq
> >        3.07%  do_raw_spin_lock
> >        2.53%  anon_vma_interval_tree_iter_first
> >        2.04%  ptep_clear_flush
> >        1.82%  __zram_bvec_write
> >        1.76%  __anon_vma_interval_tree_subtree_search
> >        1.57%  memmove
> >        1.45%  free_unref_page_list
> >
> >   Configurations:
> >     CPU: single Snapdragon 7c
> >     Mem: total 4G
> >
> >     Chrome OS MemoryPressure [1]
> >
> > [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
> >
>
> In shrink_active_list we do preferential treatment of VM_EXEC pages.
> Do we do similar thing with MGLRU? if not why is that not needed?

No, because MGLRU has a different set of assumptions than the
active/inactive LRU does [1]. It provides mmapped pages with equal
opportunities, and the tradeoff was discussed here [2].

Note that even with this preferential treatment of executable pages,
plus other heuristics added since then, executable pages are still
underprotected for at least desktop workloads [3]. And I can confirm
the problem reported is genuine -- we recently accidentally removed
our private patch that works around the problem for the last 12 years,
and observed immediate consequences on a small portion of devices not
using MGLRU [4].

[1] https://lore.kernel.org/linux-mm/20220309021230.721028-15-yuzhao@google.com/
[2] https://lore.kernel.org/linux-mm/20220208081902.3550911-5-yuzhao@google.com/
[3] https://lore.kernel.org/linux-mm/2dc51fc8-f14e-17ed-a8c6-0ec70423bf54@valdikss.org.ru/
[4] https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/3429559
Aneesh Kumar K.V March 22, 2022, 5:26 a.m. UTC | #13
Yu Zhao <yuzhao@google.com> writes:

 +
> +static void inc_min_seq(struct lruvec *lruvec)
> +{
> +	int type;
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +	VM_BUG_ON(!seq_is_valid(lruvec));
> +
> +	for (type = 0; type < ANON_AND_FILE; type++) {
> +		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
> +			continue;
> +
> +		reset_ctrl_pos(lruvec, type, true);
> +		WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
> +	}
> +}
> +
> +static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> +{
> +	int gen, type, zone;
> +	bool success = false;
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +	DEFINE_MIN_SEQ(lruvec);
> +
> +	VM_BUG_ON(!seq_is_valid(lruvec));
> +
> +	for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +		while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
> +			gen = lru_gen_from_seq(min_seq[type]);
> +
> +			for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +				if (!list_empty(&lrugen->lists[gen][type][zone]))
> +					goto next;
> +			}
> +
> +			min_seq[type]++;
> +		}
> +next:
> +		;
> +	}
> +
> +	/* see the comment on lru_gen_struct */
> +	if (can_swap) {
> +		min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
> +		min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
> +	}
> +
> +	for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +		if (min_seq[type] == lrugen->min_seq[type])
> +			continue;
> +
> +		reset_ctrl_pos(lruvec, type, true);
> +		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
> +		success = true;
> +	}
> +
> +	return success;
> +}
> +
> +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
> +{
> +	int prev, next;
> +	int type, zone;
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +	spin_lock_irq(&lruvec->lru_lock);
> +
> +	VM_BUG_ON(!seq_is_valid(lruvec));
> +
> +	if (max_seq != lrugen->max_seq)
> +		goto unlock;
> +
> +	inc_min_seq(lruvec);

Can this min seq update result in pages considered oldest become young.
ie, if we had seq value of 0 - 3 and we need ageing, the new min seq and
max_seq value will now become 1 - 4. What happens to pages in the
generation value 0 which was oldest generation earlier and is youngest
now.


> +
> +	/* update the active/inactive LRU sizes for compatibility */
> +	prev = lru_gen_from_seq(lrugen->max_seq - 1);
> +	next = lru_gen_from_seq(lrugen->max_seq + 1);
> +
> +	for (type = 0; type < ANON_AND_FILE; type++) {
> +		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +			enum lru_list lru = type * LRU_INACTIVE_FILE;
> +			long delta = lrugen->nr_pages[prev][type][zone] -
> +				     lrugen->nr_pages[next][type][zone];
> +
> +			if (!delta)
> +				continue;
> +
> +			__update_lru_size(lruvec, lru, zone, delta);
> +			__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
> +		}
> +	}
> +
> +	for (type = 0; type < ANON_AND_FILE; type++)
> +		reset_ctrl_pos(lruvec, type, false);
> +
> +	/* make sure preceding modifications appear */
> +	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
> +unlock:
> +	spin_unlock_irq(&lruvec->lru_lock);
> +}
> +

....

 +
> +static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> +{
> +	int type;
> +	int scanned;
> +	int reclaimed;
> +	LIST_HEAD(list);
> +	struct folio *folio;
> +	enum vm_event_item item;
> +	struct reclaim_stat stat;
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +	spin_lock_irq(&lruvec->lru_lock);
> +
> +	scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
> +
> +	if (try_to_inc_min_seq(lruvec, swappiness))
> +		scanned++;

we are doing this before we shrink the page list. Any reason to do this before?

> +
> +	if (get_nr_gens(lruvec, LRU_GEN_FILE) == MIN_NR_GENS)
> +		scanned = 0;
Yu Zhao March 22, 2022, 5:55 a.m. UTC | #14
On Mon, Mar 21, 2022 at 11:27 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
> > +{
> > +     int prev, next;
> > +     int type, zone;
> > +     struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +     spin_lock_irq(&lruvec->lru_lock);
> > +
> > +     VM_BUG_ON(!seq_is_valid(lruvec));
> > +
> > +     if (max_seq != lrugen->max_seq)
> > +             goto unlock;
> > +
> > +     inc_min_seq(lruvec);
>
> Can this min seq update result in pages considered oldest become young.
> ie, if we had seq value of 0 - 3 and we need ageing, the new min seq and
> max_seq value will now become 1 - 4. What happens to pages in the
> generation value 0 which was oldest generation earlier and is youngest
> now.

If anon pages are not reclaimable, e.g., no swapfile, they won't be
scanned at all. So their coldness/hotness don't matter -- they don't
need to be on lrugen->lists[] at all.

If there is a swapfile but it's full, then yes, the inversion will
happen. This can be handled by moving pages from the oldest generation
to the tail of the second oldest generation, which maintains the LRU
order.

In fact, both were handled in the previous versions [1] [2]. They were
removed in v6 for simplicity.

[1] https://lore.kernel.org/linux-mm/20211111041510.402534-5-yuzhao@google.com/
[2] https://lore.kernel.org/linux-mm/20211111041510.402534-7-yuzhao@google.com/

> > +static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> > +{
> > +     int type;
> > +     int scanned;
> > +     int reclaimed;
> > +     LIST_HEAD(list);
> > +     struct folio *folio;
> > +     enum vm_event_item item;
> > +     struct reclaim_stat stat;
> > +     struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +     struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > +
> > +     spin_lock_irq(&lruvec->lru_lock);
> > +
> > +     scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
> > +
> > +     if (try_to_inc_min_seq(lruvec, swappiness))
> > +             scanned++;
>
> we are doing this before we shrink the page list. Any reason to do this before?

We have isolated pages from lrugen->lists[], and we might have
exhausted all pages in the oldest generations, i.e.,
lrugen->lists[min_seq] is now empty. Incrementing min_seq after
shrink_page_list() is not wrong. However, it's better we do it ASAP so
that concurrent reclaimers are less likely to see a stale min_seq and
come here under the false impression that they'd make some progress.
(Instead, they will go to the aging path and inc_max_seq() first
before coming here.)
Barry Song March 23, 2022, 7:47 a.m. UTC | #15
On Sat, Mar 19, 2022 at 4:11 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Fri, Mar 18, 2022 at 9:01 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > +{
> > > +       unsigned long old_flags, new_flags;
> > > +       int type = folio_is_file_lru(folio);
> > > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > > +
> > > +       do {
> > > +               new_flags = old_flags = READ_ONCE(folio->flags);
> > > +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> > > +
> > > +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > > +               new_gen = (old_gen + 1) % MAX_NR_GENS;
> >
> > new_gen is assigned twice, i assume you mean
> >                old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> >                new_gen = (old_gen + 1) % MAX_NR_GENS;
> >
> > or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?
>
> Thanks a lot for your attention to details!
>
> The first line should be in the next patch but I overlooked during the
> last refactoring:

Thanks for the clarification. So an unmapped file-backed page which is
accessed only by system call will always be in either min_seq or
min_seq + 1? it has no chance to be in max_seq like a faulted-in
mapped file page?

>
>   new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> + /* folio_update_gen() has promoted this page? */
> + if (new_gen >= 0 && new_gen != old_gen)
> + return new_gen;
> +
>   new_gen = (old_gen + 1) % MAX_NR_GENS;

Thanks
Barry
Yu Zhao March 24, 2022, 6:24 a.m. UTC | #16
On Wed, Mar 23, 2022 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Mar 19, 2022 at 4:11 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Fri, Mar 18, 2022 at 9:01 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > > +{
> > > > +       unsigned long old_flags, new_flags;
> > > > +       int type = folio_is_file_lru(folio);
> > > > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > > +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > > > +
> > > > +       do {
> > > > +               new_flags = old_flags = READ_ONCE(folio->flags);
> > > > +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> > > > +
> > > > +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > > > +               new_gen = (old_gen + 1) % MAX_NR_GENS;
> > >
> > > new_gen is assigned twice, i assume you mean
> > >                old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > >                new_gen = (old_gen + 1) % MAX_NR_GENS;
> > >
> > > or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?
> >
> > Thanks a lot for your attention to details!
> >
> > The first line should be in the next patch but I overlooked during the
> > last refactoring:
>
> Thanks for the clarification. So an unmapped file-backed page which is
> accessed only by system call will always be in either min_seq or
> min_seq + 1? it has no chance to be in max_seq like a faulted-in
> mapped file page?

That's right. The rationale is documented here under the `Assumptions`
section [1]. This is also related to Aneesh's question about why MGLRU
doesn't need additional heuristics for VM_EXEC pages [2]. Unmapped
file pages weaken the protection of executable pages under heavy
buffered IO workloads like Java NIO.

[1] https://lore.kernel.org/linux-mm/20220309021230.721028-15-yuzhao@google.com/
[2] https://lore.kernel.org/linux-mm/CAOUHufYfpiGdLSdffvzDqaD5oYFG99oDJ2xgQd2Ph77OFR5NAA@mail.gmail.com/
Barry Song March 24, 2022, 8:13 a.m. UTC | #17
On Thu, Mar 24, 2022 at 7:24 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Mar 23, 2022 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, Mar 19, 2022 at 4:11 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Fri, Mar 18, 2022 at 9:01 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > > > +{
> > > > > +       unsigned long old_flags, new_flags;
> > > > > +       int type = folio_is_file_lru(folio);
> > > > > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > > > +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > > > > +
> > > > > +       do {
> > > > > +               new_flags = old_flags = READ_ONCE(folio->flags);
> > > > > +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> > > > > +
> > > > > +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > > > > +               new_gen = (old_gen + 1) % MAX_NR_GENS;
> > > >
> > > > new_gen is assigned twice, i assume you mean
> > > >                old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > > >                new_gen = (old_gen + 1) % MAX_NR_GENS;
> > > >
> > > > or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?
> > >
> > > Thanks a lot for your attention to details!
> > >
> > > The first line should be in the next patch but I overlooked during the
> > > last refactoring:
> >
> > Thanks for the clarification. So an unmapped file-backed page which is
> > accessed only by system call will always be in either min_seq or
> > min_seq + 1? it has no chance to be in max_seq like a faulted-in
> > mapped file page?
>
> That's right. The rationale is documented here under the `Assumptions`
> section [1]. This is also related to Aneesh's question about why MGLRU
> doesn't need additional heuristics for VM_EXEC pages [2]. Unmapped
> file pages weaken the protection of executable pages under heavy
> buffered IO workloads like Java NIO.

ok. This is probably right.
i will also run a test by maltreating unmapped page in vanilla LRU, the
PoC code is like (not been tested yet):

Subject: [PATCH 1/1] mm: vmscan: maltreat unmapped file-backed pages

[This patch has not been tested yet.]

A lesson we learned from MGLRU is that mapped filed-backed pages
are much more important than unmapped ones.
So this patch doesn't move the second accessed unmapped pages to
the active list, alternatively, it keeps the pages in the inactive
list. And we abuse PG_workingset to let the memory reclaim this
is a relatively hot file-backed page, so the reclaim should keep
the pages in the inactive list.

---
 mm/swap.c   | 34 ++++++++++++++++++++++------------
 mm/vmscan.c |  6 ++++--
 2 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index e65e7520bebf..cb0c6e704f2e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -470,18 +470,28 @@ void folio_mark_accessed(struct folio *folio)
  * evictable page accessed has no effect.
  */
  } else if (!folio_test_active(folio)) {
- /*
- * If the page is on the LRU, queue it for activation via
- * lru_pvecs.activate_page. Otherwise, assume the page is on a
- * pagevec, mark it active and it'll be moved to the active
- * LRU on the next drain.
- */
- if (folio_test_lru(folio))
- folio_activate(folio);
- else
- __lru_cache_activate_folio(folio);
- folio_clear_referenced(folio);
- workingset_activation(folio);
+ if (folio_mapped(folio)) {
+ /*
+ * If the mapped page is on the LRU, queue it for activation via
+ * lru_pvecs.activate_page. Otherwise, assume the page is on a
+ * pagevec, mark it active and it'll be moved to the active
+ * LRU on the next drain.
+ */
+ if (folio_test_lru(folio))
+ folio_activate(folio);
+ else
+ __lru_cache_activate_folio(folio);
+ folio_clear_referenced(folio);
+ workingset_activation(folio);
+ } else {
+ /*
+ * we maltreat unmmaped file-backed pages and abuse PG_workingset
+ * flag to let the eviction know this page is a relatively hot file
+ * page, thus, the eviction can move it back to the head of the
+ * inactive list
+ */
+ folio_set_workingset(folio);
+ }
  }
  if (folio_test_idle(folio))
  folio_clear_idle(folio);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d6f3c9812f97..56a66eb4a3f7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1393,12 +1393,14 @@ enum page_references {
 static enum page_references page_check_references(struct page *page,
    struct scan_control *sc)
 {
- int referenced_ptes, referenced_page;
+ int referenced_ptes, referenced_page, workingset;
  unsigned long vm_flags;

  referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
    &vm_flags);
  referenced_page = TestClearPageReferenced(page);
+ workingset = page_is_file_lru(page) && !page_mapped(page) &&
+ TestClearPageWorkingset(page);

  /*
  * Mlock lost the isolation race with us.  Let try_to_unmap()
@@ -1438,7 +1440,7 @@ static enum page_references
page_check_references(struct page *page,

  /* Reclaim if clean, defer dirty pages to writeback */
  if (referenced_page && !PageSwapBacked(page))
- return PAGEREF_RECLAIM_CLEAN;
+ return workingset ?  PAGEREF_KEEP : PAGEREF_RECLAIM_CLEAN;

  return PAGEREF_RECLAIM;
 }

>
> [1] https://lore.kernel.org/linux-mm/20220309021230.721028-15-yuzhao@google.com/
> [2] https://lore.kernel.org/linux-mm/CAOUHufYfpiGdLSdffvzDqaD5oYFG99oDJ2xgQd2Ph77OFR5NAA@mail.gmail.com/

Thanks
Barry
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1162659d824..1e3e6dd90c0f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -227,6 +227,7 @@  int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
 #define PAGE_ALIGNED(addr)	IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
 
 #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
+#define lru_to_folio(head) (list_entry((head)->prev, struct folio, lru))
 
 void setup_initial_init_mm(void *start_code, void *end_code,
 			   void *end_data, void *brk);
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e3594171b421..15a04a9b5560 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -119,6 +119,19 @@  static inline int lru_gen_from_seq(unsigned long seq)
 	return seq % MAX_NR_GENS;
 }
 
+static inline int lru_hist_from_seq(unsigned long seq)
+{
+	return seq % NR_HIST_GENS;
+}
+
+static inline int lru_tier_from_refs(int refs)
+{
+	VM_BUG_ON(refs > BIT(LRU_REFS_WIDTH));
+
+	/* see the comment on MAX_NR_TIERS */
+	return order_base_2(refs + 1);
+}
+
 static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
 {
 	unsigned long max_seq = lruvec->lrugen.max_seq;
@@ -164,6 +177,15 @@  static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *foli
 		__update_lru_size(lruvec, lru, zone, -delta);
 		return;
 	}
+
+	/* promotion */
+	if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
+		__update_lru_size(lruvec, lru, zone, -delta);
+		__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta);
+	}
+
+	/* demotion requires isolation, e.g., lru_deactivate_fn() */
+	VM_BUG_ON(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
 }
 
 static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
@@ -229,6 +251,8 @@  static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio,
 		gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
 
 		new_flags &= ~LRU_GEN_MASK;
+		if ((new_flags & LRU_REFS_FLAGS) != LRU_REFS_FLAGS)
+			new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
 		/* for shrink_page_list() */
 		if (reclaiming)
 			new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim));
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a88e27d85693..307c5c24c7ac 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -333,6 +333,29 @@  enum lruvec_flags {
 #define MIN_NR_GENS		2U
 #define MAX_NR_GENS		4U
 
+/*
+ * Each generation is divided into multiple tiers. Tiers represent different
+ * ranges of numbers of accesses through file descriptors. A page accessed N
+ * times through file descriptors is in tier order_base_2(N). A page in the
+ * first tier (N=0,1) is marked by PG_referenced unless it was faulted in
+ * though page tables or read ahead. A page in any other tier (N>1) is marked
+ * by PG_referenced and PG_workingset. Two additional bits in folio->flags are
+ * required to support four tiers.
+ *
+ * In contrast to moving across generations which requires the LRU lock, moving
+ * across tiers only requires operations on folio->flags and therefore has a
+ * negligible cost in the buffered access path. In the eviction path,
+ * comparisons of refaulted/(evicted+protected) from the first tier and the
+ * rest infer whether pages accessed multiple times through file descriptors
+ * are statistically hot and thus worth protecting.
+ *
+ * MAX_NR_TIERS is set to 4 so that the multi-gen LRU has of twice of the
+ * categories of the active/inactive LRU when tracking accesses through file
+ * descriptors.
+ */
+#define MAX_NR_TIERS		4U
+#define LRU_REFS_FLAGS		(BIT(PG_referenced) | BIT(PG_workingset))
+
 #ifndef __GENERATING_BOUNDS_H
 
 struct lruvec;
@@ -347,6 +370,16 @@  enum {
 	LRU_GEN_FILE,
 };
 
+#define MIN_LRU_BATCH		BITS_PER_LONG
+#define MAX_LRU_BATCH		(MIN_LRU_BATCH * 128)
+
+/* whether to keep historical stats from evicted generations */
+#ifdef CONFIG_LRU_GEN_STATS
+#define NR_HIST_GENS		MAX_NR_GENS
+#else
+#define NR_HIST_GENS		1U
+#endif
+
 /*
  * The youngest generation number is stored in max_seq for both anon and file
  * types as they are aged on an equal footing. The oldest generation numbers are
@@ -366,6 +399,15 @@  struct lru_gen_struct {
 	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
 	/* the sizes of the above lists */
 	unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* the exponential moving average of refaulted */
+	unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
+	/* the exponential moving average of evicted+protected */
+	unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
+	/* the first tier doesn't need protection, hence the minus one */
+	unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
+	/* can be modified without holding the LRU lock */
+	atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
+	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 };
 
 void lru_gen_init_lruvec(struct lruvec *lruvec);
diff --git a/kernel/bounds.c b/kernel/bounds.c
index e08fb89f87f4..10dd9e6b03e5 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -24,7 +24,7 @@  int main(void)
 	DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
 #ifdef CONFIG_LRU_GEN
 	DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
-	DEFINE(LRU_REFS_WIDTH, 0);
+	DEFINE(LRU_REFS_WIDTH, MAX_NR_TIERS - 2);
 #else
 	DEFINE(LRU_GEN_WIDTH, 0);
 	DEFINE(LRU_REFS_WIDTH, 0);
diff --git a/mm/Kconfig b/mm/Kconfig
index 747ab1690bcf..804c2bca8205 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -900,6 +900,15 @@  config LRU_GEN
 	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
 	help
 	  A high performance LRU implementation for memory overcommit.
+
+config LRU_GEN_STATS
+	bool "Full stats for debugging"
+	depends on LRU_GEN
+	help
+	  Do not enable this option unless you plan to look at historical stats
+	  from evicted generations for debugging purpose.
+
+	  This option has a per-memcg and per-node memory overhead.
 # }
 
 source "mm/damon/Kconfig"
diff --git a/mm/swap.c b/mm/swap.c
index e5f2ab3dab4a..f5c0bcac8dcd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -407,6 +407,43 @@  static void __lru_cache_activate_folio(struct folio *folio)
 	local_unlock(&lru_pvecs.lock);
 }
 
+#ifdef CONFIG_LRU_GEN
+static void folio_inc_refs(struct folio *folio)
+{
+	unsigned long refs;
+	unsigned long old_flags, new_flags;
+
+	if (folio_test_unevictable(folio))
+		return;
+
+	/* see the comment on MAX_NR_TIERS */
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+
+		if (!(new_flags & BIT(PG_referenced))) {
+			new_flags |= BIT(PG_referenced);
+			continue;
+		}
+
+		if (!(new_flags & BIT(PG_workingset))) {
+			new_flags |= BIT(PG_workingset);
+			continue;
+		}
+
+		refs = new_flags & LRU_REFS_MASK;
+		refs = min(refs + BIT(LRU_REFS_PGOFF), LRU_REFS_MASK);
+
+		new_flags &= ~LRU_REFS_MASK;
+		new_flags |= refs;
+	} while (new_flags != old_flags &&
+		 cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+}
+#else
+static void folio_inc_refs(struct folio *folio)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
 /*
  * Mark a page as having seen activity.
  *
@@ -419,6 +456,11 @@  static void __lru_cache_activate_folio(struct folio *folio)
  */
 void folio_mark_accessed(struct folio *folio)
 {
+	if (lru_gen_enabled()) {
+		folio_inc_refs(folio);
+		return;
+	}
+
 	if (!folio_test_referenced(folio)) {
 		folio_set_referenced(folio);
 	} else if (folio_test_unevictable(folio)) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 65eb668abf2d..91a827ff665d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1287,9 +1287,11 @@  static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
-		mem_cgroup_swapout(page, swap);
+
+		/* get a shadow entry before mem_cgroup_swapout() clears folio_memcg() */
 		if (reclaimed && !mapping_exiting(mapping))
 			shadow = workingset_eviction(page, target_memcg);
+		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page, swap, shadow);
 		xa_unlock_irq(&mapping->i_pages);
 		put_swap_page(page, swap);
@@ -2723,6 +2725,9 @@  static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long file;
 	struct lruvec *target_lruvec;
 
+	if (lru_gen_enabled())
+		return;
+
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
 
 	/*
@@ -3048,11 +3053,38 @@  static bool can_age_anon_pages(struct pglist_data *pgdat,
  *                          shorthand helpers
  ******************************************************************************/
 
+#define DEFINE_MAX_SEQ(lruvec)						\
+	unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq)
+
+#define DEFINE_MIN_SEQ(lruvec)						\
+	unsigned long min_seq[ANON_AND_FILE] = {			\
+		READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_ANON]),	\
+		READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]),	\
+	}
+
 #define for_each_gen_type_zone(gen, type, zone)				\
 	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
 		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
 			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
 
+static int folio_lru_gen(struct folio *folio)
+{
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
+static int folio_lru_tier(struct folio *folio)
+{
+	int refs;
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	refs = (flags & LRU_REFS_FLAGS) == LRU_REFS_FLAGS ?
+	       ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + 1 : 0;
+
+	return lru_tier_from_refs(refs);
+}
+
 static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
@@ -3071,6 +3103,735 @@  static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 	return pgdat ? &pgdat->__lruvec : NULL;
 }
 
+static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	if (!can_demote(pgdat->node_id, sc) &&
+	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
+		return 0;
+
+	return mem_cgroup_swappiness(memcg);
+}
+
+static int get_nr_gens(struct lruvec *lruvec, int type)
+{
+	return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1;
+}
+
+static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
+{
+	/* see the comment on lru_gen_struct */
+	return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS &&
+	       get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) &&
+	       get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
+}
+
+/******************************************************************************
+ *                          refault feedback loop
+ ******************************************************************************/
+
+/*
+ * A feedback loop based on Proportional-Integral-Derivative (PID) controller.
+ *
+ * The P term is refaulted/(evicted+protected) from a tier in the generation
+ * currently being evicted; the I term is the exponential moving average of the
+ * P term over the generations previously evicted, using the smoothing factor
+ * 1/2; the D term isn't supported.
+ *
+ * The setpoint (SP) is always the first tier of one type; the process variable
+ * (PV) is either any tier of the other type or any other tier of the same
+ * type.
+ *
+ * The error is the difference between the SP and the PV; the correction is
+ * turn off protection when SP>PV or turn on protection when SP<PV.
+ *
+ * For future optimizations:
+ * 1. The D term may discount the other two terms over time so that long-lived
+ *    generations can resist stale information.
+ */
+struct ctrl_pos {
+	unsigned long refaulted;
+	unsigned long total;
+	int gain;
+};
+
+static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
+			  struct ctrl_pos *pos)
+{
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	int hist = lru_hist_from_seq(lrugen->min_seq[type]);
+
+	pos->refaulted = lrugen->avg_refaulted[type][tier] +
+			 atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+	pos->total = lrugen->avg_total[type][tier] +
+		     atomic_long_read(&lrugen->evicted[hist][type][tier]);
+	if (tier)
+		pos->total += lrugen->protected[hist][type][tier - 1];
+	pos->gain = gain;
+}
+
+static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
+{
+	int hist, tier;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1;
+	unsigned long seq = carryover ? lrugen->min_seq[type] : lrugen->max_seq + 1;
+
+	lockdep_assert_held(&lruvec->lru_lock);
+
+	if (!carryover && !clear)
+		return;
+
+	hist = lru_hist_from_seq(seq);
+
+	for (tier = 0; tier < MAX_NR_TIERS; tier++) {
+		if (carryover) {
+			unsigned long sum;
+
+			sum = lrugen->avg_refaulted[type][tier] +
+			      atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+			WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
+
+			sum = lrugen->avg_total[type][tier] +
+			      atomic_long_read(&lrugen->evicted[hist][type][tier]);
+			if (tier)
+				sum += lrugen->protected[hist][type][tier - 1];
+			WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
+		}
+
+		if (clear) {
+			atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
+			atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
+			if (tier)
+				WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0);
+		}
+	}
+}
+
+static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
+{
+	/*
+	 * Return true if the PV has a limited number of refaults or a lower
+	 * refaulted/total than the SP.
+	 */
+	return pv->refaulted < MIN_LRU_BATCH ||
+	       pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <=
+	       (sp->refaulted + 1) * pv->total * pv->gain;
+}
+
+/******************************************************************************
+ *                          the aging
+ ******************************************************************************/
+
+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	unsigned long old_flags, new_flags;
+	int type = folio_is_file_lru(folio);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+		VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
+
+		new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+		new_gen = (old_gen + 1) % MAX_NR_GENS;
+
+		new_flags &= ~LRU_GEN_MASK;
+		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
+		new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
+		/* for folio_end_writeback() */
+		if (reclaiming)
+			new_flags |= BIT(PG_reclaim);
+	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+
+	return new_gen;
+}
+
+static void inc_min_seq(struct lruvec *lruvec)
+{
+	int type;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
+			continue;
+
+		reset_ctrl_pos(lruvec, type, true);
+		WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
+	}
+}
+
+static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
+{
+	int gen, type, zone;
+	bool success = false;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	DEFINE_MIN_SEQ(lruvec);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
+			gen = lru_gen_from_seq(min_seq[type]);
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+				if (!list_empty(&lrugen->lists[gen][type][zone]))
+					goto next;
+			}
+
+			min_seq[type]++;
+		}
+next:
+		;
+	}
+
+	/* see the comment on lru_gen_struct */
+	if (can_swap) {
+		min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
+		min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
+	}
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		if (min_seq[type] == lrugen->min_seq[type])
+			continue;
+
+		reset_ctrl_pos(lruvec, type, true);
+		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
+		success = true;
+	}
+
+	return success;
+}
+
+static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
+{
+	int prev, next;
+	int type, zone;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	if (max_seq != lrugen->max_seq)
+		goto unlock;
+
+	inc_min_seq(lruvec);
+
+	/* update the active/inactive LRU sizes for compatibility */
+	prev = lru_gen_from_seq(lrugen->max_seq - 1);
+	next = lru_gen_from_seq(lrugen->max_seq + 1);
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			enum lru_list lru = type * LRU_INACTIVE_FILE;
+			long delta = lrugen->nr_pages[prev][type][zone] -
+				     lrugen->nr_pages[next][type][zone];
+
+			if (!delta)
+				continue;
+
+			__update_lru_size(lruvec, lru, zone, delta);
+			__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
+		}
+	}
+
+	for (type = 0; type < ANON_AND_FILE; type++)
+		reset_ctrl_pos(lruvec, type, false);
+
+	/* make sure preceding modifications appear */
+	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
+unlock:
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
+			     unsigned long *min_seq, bool can_swap, bool *need_aging)
+{
+	int gen, type, zone;
+	long old = 0;
+	long young = 0;
+	long total = 0;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		unsigned long seq;
+
+		for (seq = min_seq[type]; seq <= max_seq; seq++) {
+			long size = 0;
+
+			gen = lru_gen_from_seq(seq);
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++)
+				size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
+
+			total += size;
+			if (seq == max_seq)
+				young += size;
+			if (seq + MIN_NR_GENS == max_seq)
+				old += size;
+		}
+	}
+
+	/* try to spread pages out across MIN_NR_GENS+1 generations */
+	if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
+		*need_aging = true;
+	else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
+		*need_aging = false;
+	else if (young * MIN_NR_GENS > total)
+		*need_aging = true;
+	else if (old * (MIN_NR_GENS + 2) < total)
+		*need_aging = true;
+	else
+		*need_aging = false;
+
+	return total > 0 ? total : 0;
+}
+
+static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	bool need_aging;
+	long nr_to_scan;
+	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	mem_cgroup_calculate_protection(NULL, memcg);
+
+	if (mem_cgroup_below_min(memcg))
+		return;
+
+	nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, swappiness, &need_aging);
+	if (!nr_to_scan)
+		return;
+
+	nr_to_scan >>= sc->priority;
+
+	if (!mem_cgroup_online(memcg))
+		nr_to_scan++;
+
+	if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
+		inc_max_seq(lruvec, max_seq);
+}
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+	struct mem_cgroup *memcg;
+
+	VM_BUG_ON(!current_is_kswapd());
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
+		age_lruvec(lruvec, sc);
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+}
+
+/******************************************************************************
+ *                          the eviction
+ ******************************************************************************/
+
+static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
+{
+	bool success;
+	int gen = folio_lru_gen(folio);
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	int tier = folio_lru_tier(folio);
+	int delta = folio_nr_pages(folio);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON_FOLIO(gen >= MAX_NR_GENS, folio);
+
+	if (!folio_evictable(folio)) {
+		success = lru_gen_del_folio(lruvec, folio, true);
+		VM_BUG_ON_FOLIO(!success, folio);
+		folio_set_unevictable(folio);
+		lruvec_add_folio(lruvec, folio);
+		__count_vm_events(UNEVICTABLE_PGCULLED, delta);
+		return true;
+	}
+
+	if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
+		success = lru_gen_del_folio(lruvec, folio, true);
+		VM_BUG_ON_FOLIO(!success, folio);
+		folio_set_swapbacked(folio);
+		lruvec_add_folio_tail(lruvec, folio);
+		return true;
+	}
+
+	if (tier > tier_idx) {
+		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
+
+		gen = folio_inc_gen(lruvec, folio, false);
+		list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
+
+		WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
+			   lrugen->protected[hist][type][tier - 1] + delta);
+		__mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+		return true;
+	}
+
+	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
+	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
+		gen = folio_inc_gen(lruvec, folio, true);
+		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
+		return true;
+	}
+
+	return false;
+}
+
+static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc)
+{
+	bool success;
+
+	if (!sc->may_unmap && folio_mapped(folio))
+		return false;
+
+	if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) &&
+	    (folio_test_dirty(folio) ||
+	     (folio_test_anon(folio) && !folio_test_swapcache(folio))))
+		return false;
+
+	if (!folio_try_get(folio))
+		return false;
+
+	if (!folio_test_clear_lru(folio)) {
+		folio_put(folio);
+		return false;
+	}
+
+	success = lru_gen_del_folio(lruvec, folio, true);
+	VM_BUG_ON_FOLIO(!success, folio);
+
+	return true;
+}
+
+static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
+		       int type, int tier, struct list_head *list)
+{
+	int gen, zone;
+	enum vm_event_item item;
+	int sorted = 0;
+	int scanned = 0;
+	int isolated = 0;
+	int remaining = MAX_LRU_BATCH;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	VM_BUG_ON(!list_empty(list));
+
+	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
+		return 0;
+
+	gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	for (zone = sc->reclaim_idx; zone >= 0; zone--) {
+		LIST_HEAD(moved);
+		int skipped = 0;
+		struct list_head *head = &lrugen->lists[gen][type][zone];
+
+		while (!list_empty(head)) {
+			struct folio *folio = lru_to_folio(head);
+			int delta = folio_nr_pages(folio);
+
+			VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+			VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+			VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
+			VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
+
+			scanned += delta;
+
+			if (sort_folio(lruvec, folio, tier))
+				sorted += delta;
+			else if (isolate_folio(lruvec, folio, sc)) {
+				list_add(&folio->lru, list);
+				isolated += delta;
+			} else {
+				list_move(&folio->lru, &moved);
+				skipped += delta;
+			}
+
+			if (!--remaining || max(isolated, skipped) >= MIN_LRU_BATCH)
+				break;
+		}
+
+		if (skipped) {
+			list_splice(&moved, head);
+			__count_zid_vm_events(PGSCAN_SKIP, zone, skipped);
+		}
+
+		if (!remaining || isolated >= MIN_LRU_BATCH)
+			break;
+	}
+
+	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
+	if (!cgroup_reclaim(sc)) {
+		__count_vm_events(item, isolated);
+		__count_vm_events(PGREFILL, sorted);
+	}
+	__count_memcg_events(memcg, item, isolated);
+	__count_memcg_events(memcg, PGREFILL, sorted);
+	__count_vm_events(PGSCAN_ANON + type, isolated);
+
+	/*
+	 * There might not be eligible pages due to reclaim_idx, may_unmap and
+	 * may_writepage. Check the remaining to prevent livelock if there is no
+	 * progress.
+	 */
+	return isolated || !remaining ? scanned : 0;
+}
+
+static int get_tier_idx(struct lruvec *lruvec, int type)
+{
+	int tier;
+	struct ctrl_pos sp, pv;
+
+	/*
+	 * To leave a margin for fluctuations, use a larger gain factor (1:2).
+	 * This value is chosen because any other tier would have at least twice
+	 * as many refaults as the first tier.
+	 */
+	read_ctrl_pos(lruvec, type, 0, 1, &sp);
+	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
+		read_ctrl_pos(lruvec, type, tier, 2, &pv);
+		if (!positive_ctrl_err(&sp, &pv))
+			break;
+	}
+
+	return tier - 1;
+}
+
+static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx)
+{
+	int type, tier;
+	struct ctrl_pos sp, pv;
+	int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
+
+	/*
+	 * Compare the first tier of anon with that of file to determine which
+	 * type to scan. Also need to compare other tiers of the selected type
+	 * with the first tier of the other type to determine the last tier (of
+	 * the selected type) to evict.
+	 */
+	read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp);
+	read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv);
+	type = positive_ctrl_err(&sp, &pv);
+
+	read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
+	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
+		read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
+		if (!positive_ctrl_err(&sp, &pv))
+			break;
+	}
+
+	*tier_idx = tier - 1;
+
+	return type;
+}
+
+static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
+			  int *type_scanned, struct list_head *list)
+{
+	int i;
+	int type;
+	int scanned;
+	int tier = -1;
+	DEFINE_MIN_SEQ(lruvec);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	/*
+	 * Try to make the obvious choice first. When anon and file are both
+	 * available from the same generation, interpret swappiness 1 as file
+	 * first and 200 as anon first.
+	 */
+	if (!swappiness)
+		type = LRU_GEN_FILE;
+	else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE])
+		type = LRU_GEN_ANON;
+	else if (swappiness == 1)
+		type = LRU_GEN_FILE;
+	else if (swappiness == 200)
+		type = LRU_GEN_ANON;
+	else
+		type = get_type_to_scan(lruvec, swappiness, &tier);
+
+	for (i = !swappiness; i < ANON_AND_FILE; i++) {
+		if (tier < 0)
+			tier = get_tier_idx(lruvec, type);
+
+		scanned = scan_folios(lruvec, sc, type, tier, list);
+		if (scanned)
+			break;
+
+		type = !type;
+		tier = -1;
+	}
+
+	*type_scanned = type;
+
+	return scanned;
+}
+
+static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+{
+	int type;
+	int scanned;
+	int reclaimed;
+	LIST_HEAD(list);
+	struct folio *folio;
+	enum vm_event_item item;
+	struct reclaim_stat stat;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
+
+	if (try_to_inc_min_seq(lruvec, swappiness))
+		scanned++;
+
+	if (get_nr_gens(lruvec, LRU_GEN_FILE) == MIN_NR_GENS)
+		scanned = 0;
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	if (list_empty(&list))
+		return scanned;
+
+	reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false);
+
+	/*
+	 * To avoid livelock, don't add rejected pages back to the same lists
+	 * they were isolated from. See lru_gen_add_folio().
+	 */
+	list_for_each_entry(folio, &list, lru) {
+		if (folio_test_reclaim(folio) &&
+		    (folio_test_dirty(folio) || folio_test_writeback(folio)))
+			folio_clear_active(folio);
+		else if (folio_is_file_lru(folio) || folio_test_swapcache(folio))
+			folio_set_active(folio);
+
+		folio_clear_referenced(folio);
+		folio_clear_workingset(folio);
+	}
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	move_pages_to_lru(lruvec, &list);
+
+	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
+	if (!cgroup_reclaim(sc))
+		__count_vm_events(item, reclaimed);
+	__count_memcg_events(memcg, item, reclaimed);
+	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	mem_cgroup_uncharge_list(&list);
+	free_unref_page_list(&list);
+
+	sc->nr_reclaimed += reclaimed;
+
+	return scanned;
+}
+
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap)
+{
+	bool need_aging;
+	long nr_to_scan;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	if (mem_cgroup_below_min(memcg) ||
+	    (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
+		return 0;
+
+	nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, can_swap, &need_aging);
+	if (!nr_to_scan)
+		return 0;
+
+	/* reset the priority if the target has been met */
+	nr_to_scan >>= sc->nr_reclaimed < sc->nr_to_reclaim ? sc->priority : DEF_PRIORITY;
+
+	if (!mem_cgroup_online(memcg))
+		nr_to_scan++;
+
+	if (!nr_to_scan)
+		return 0;
+
+	if (!need_aging)
+		return nr_to_scan;
+
+	/* leave the work to lru_gen_age_node() */
+	if (current_is_kswapd())
+		return 0;
+
+	/* try other memcgs before going to the aging path */
+	if (!cgroup_reclaim(sc) && !sc->force_deactivate) {
+		sc->skipped_deactivate = true;
+		return 0;
+	}
+
+	inc_max_seq(lruvec, max_seq);
+
+	return nr_to_scan;
+}
+
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct blk_plug plug;
+	long scanned = 0;
+
+	lru_add_drain();
+
+	blk_start_plug(&plug);
+
+	while (true) {
+		int delta;
+		int swappiness;
+		long nr_to_scan;
+
+		if (sc->may_swap)
+			swappiness = get_swappiness(lruvec, sc);
+		else if (!cgroup_reclaim(sc) && get_swappiness(lruvec, sc))
+			swappiness = 1;
+		else
+			swappiness = 0;
+
+		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
+		if (!nr_to_scan)
+			break;
+
+		delta = evict_folios(lruvec, sc, swappiness);
+		if (!delta)
+			break;
+
+		scanned += delta;
+		if (scanned >= nr_to_scan)
+			break;
+
+		cond_resched();
+	}
+
+	blk_finish_plug(&plug);
+}
+
 /******************************************************************************
  *                          initialization
  ******************************************************************************/
@@ -3113,6 +3874,16 @@  static int __init init_lru_gen(void)
 };
 late_initcall(init_lru_gen);
 
+#else
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+}
+
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -3126,6 +3897,11 @@  static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	struct blk_plug plug;
 	bool scan_adjusted;
 
+	if (lru_gen_enabled()) {
+		lru_gen_shrink_lruvec(lruvec, sc);
+		return;
+	}
+
 	get_scan_count(lruvec, sc, nr);
 
 	/* Record the original scan target for proportional adjustments later */
@@ -3630,6 +4406,9 @@  static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
 	struct lruvec *target_lruvec;
 	unsigned long refaults;
 
+	if (lru_gen_enabled())
+		return;
+
 	target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
 	target_lruvec->refaults[0] = refaults;
@@ -4000,6 +4779,11 @@  static void age_active_anon(struct pglist_data *pgdat,
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
+	if (lru_gen_enabled()) {
+		lru_gen_age_node(pgdat, sc);
+		return;
+	}
+
 	if (!can_age_anon_pages(pgdat, sc))
 		return;
 
diff --git a/mm/workingset.c b/mm/workingset.c
index 8c03afe1d67c..93ee00c7e4d1 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -187,7 +187,6 @@  static unsigned int bucket_order __read_mostly;
 static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
 			 bool workingset)
 {
-	eviction >>= bucket_order;
 	eviction &= EVICTION_MASK;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
@@ -212,10 +211,116 @@  static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 
 	*memcgidp = memcgid;
 	*pgdat = NODE_DATA(nid);
-	*evictionp = entry << bucket_order;
+	*evictionp = entry;
 	*workingsetp = workingset;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+static int folio_lru_refs(struct folio *folio)
+{
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
+
+	/* see the comment on MAX_NR_TIERS */
+	return flags & BIT(PG_workingset) ? (flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF : 0;
+}
+
+static void *lru_gen_eviction(struct folio *folio)
+{
+	int hist, tier;
+	unsigned long token;
+	unsigned long min_seq;
+	struct lruvec *lruvec;
+	struct lru_gen_struct *lrugen;
+	int type = folio_is_file_lru(folio);
+	int refs = folio_lru_refs(folio);
+	int delta = folio_nr_pages(folio);
+	bool workingset = folio_test_workingset(folio);
+	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct pglist_data *pgdat = folio_pgdat(folio);
+
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	lrugen = &lruvec->lrugen;
+	min_seq = READ_ONCE(lrugen->min_seq[type]);
+	token = (min_seq << LRU_REFS_WIDTH) | refs;
+
+	hist = lru_hist_from_seq(min_seq);
+	tier = lru_tier_from_refs(refs + workingset);
+	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
+
+	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
+}
+
+static void lru_gen_refault(struct folio *folio, void *shadow)
+{
+	int hist, tier, refs;
+	int memcg_id;
+	bool workingset;
+	unsigned long token;
+	unsigned long min_seq;
+	struct lruvec *lruvec;
+	struct lru_gen_struct *lrugen;
+	struct mem_cgroup *memcg;
+	struct pglist_data *pgdat;
+	int type = folio_is_file_lru(folio);
+	int delta = folio_nr_pages(folio);
+
+	unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset);
+
+	refs = token & (BIT(LRU_REFS_WIDTH) - 1);
+	if (refs && !workingset)
+		return;
+
+	if (folio_pgdat(folio) != pgdat)
+		return;
+
+	rcu_read_lock();
+	memcg = folio_memcg_rcu(folio);
+	if (mem_cgroup_id(memcg) != memcg_id)
+		goto unlock;
+
+	token >>= LRU_REFS_WIDTH;
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	lrugen = &lruvec->lrugen;
+	min_seq = READ_ONCE(lrugen->min_seq[type]);
+	if (token != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)))
+		goto unlock;
+
+	hist = lru_hist_from_seq(min_seq);
+	tier = lru_tier_from_refs(refs + workingset);
+	atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
+	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
+
+	/*
+	 * Count the following two cases as stalls:
+	 * 1. For pages accessed through page tables, hotter pages pushed out
+	 *    hot pages which refaulted immediately.
+	 * 2. For pages accessed through file descriptors, numbers of accesses
+	 *    might have been beyond the limit.
+	 */
+	if (lru_gen_in_fault() || refs + workingset == BIT(LRU_REFS_WIDTH)) {
+		folio_set_workingset(folio);
+		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
+	}
+unlock:
+	rcu_read_unlock();
+}
+
+#else
+
+static void *lru_gen_eviction(struct folio *folio)
+{
+	return NULL;
+}
+
+static void lru_gen_refault(struct folio *folio, void *shadow)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 /**
  * workingset_age_nonresident - age non-resident entries as LRU ages
  * @lruvec: the lruvec that was aged
@@ -264,10 +369,14 @@  void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
+	if (lru_gen_enabled())
+		return lru_gen_eviction(page_folio(page));
+
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
 	eviction = atomic_long_read(&lruvec->nonresident_age);
+	eviction >>= bucket_order;
 	workingset_age_nonresident(lruvec, thp_nr_pages(page));
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
@@ -297,7 +406,13 @@  void workingset_refault(struct folio *folio, void *shadow)
 	int memcgid;
 	long nr;
 
+	if (lru_gen_enabled()) {
+		lru_gen_refault(folio, shadow);
+		return;
+	}
+
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
+	eviction <<= bucket_order;
 
 	rcu_read_lock();
 	/*