[v7,05/12] mm: multigenerational LRU: minimal implementation

To avoid confusions, the terms "promotion" and "demotion" will be
applied to the multigenerational LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.

The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. Since the aging is only
interested in hot pages, its complexity is O(nr_hot_pages). Promotion
in the aging path doesn't require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless
as the result of the increment of max_seq, requires LRU list
operations, e.g., lru_deactivate_fn().

The eviction consumes old generations. Given an lruvec, it increments
min_seq when the list indexed by min_seq%MAX_NR_GENS becomes empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both are
available from the same generation.

Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers don't have dedicated lrugen->lists[], only bits
in folio->flags.  In contrast to moving across generations which
requires the LRU lock, moving between tiers only involves operations
on folio->flags. The feedback loop also monitors refaults over all
tiers and decides when to promote pages in which tiers (N>1), using
the first tier (N=0,1) as a baseline. The first tier contains
single-use unmapped clean pages, which are most likely the best
choices. The eviction promotes a page to the next generation, i.e.,
min_seq+1 rather than max_seq, if the feedback loop decides so. This
approach has the following advantages:
1) It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth promoting in the
   eviction path.
2) It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier since N=0.)
3) More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[47, 49]%
                IOPS         BW
      5.17-rc2: 2242k        8759MiB/s
      patch1-5: 3321k        12.7GiB/s

  Single workload:
    memcached (anon): +[101, 105]%
                Ops/sec      KB/sec
      5.17-rc2: 476771.79    18544.31
      patch1-5: 972526.07    37826.95

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was used as a ram disk only to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<OEF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    OEF

    cat >>/etc/memcached.conf <<OEF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    OEF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.17-rc2
      38.05%  page_vma_mapped_walk
      20.86%  lzo1x_1_do_compress (real work)
       6.16%  do_raw_spin_lock
       4.61%  _raw_spin_unlock_irq
       2.20%  vma_interval_tree_iter_next
       2.19%  vma_interval_tree_subtree_search
       2.15%  page_referenced_one
       1.93%  anon_vma_interval_tree_iter_first
       1.65%  ptep_clear_flush
       1.00%  __zram_bvec_write

    patch1-5
      39.73%  lzo1x_1_do_compress (real work)
      14.96%  page_vma_mapped_walk
       6.97%  _raw_spin_unlock_irq
       3.07%  do_raw_spin_lock
       2.53%  anon_vma_interval_tree_iter_first
       2.04%  ptep_clear_flush
       1.82%  __zram_bvec_write
       1.76%  __anon_vma_interval_tree_subtree_search
       1.57%  memmove
       1.45%  free_unref_page_list

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    Chrome OS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 include/linux/mm.h        |   1 +
 include/linux/mm_inline.h |  15 +
 include/linux/mmzone.h    |  35 ++
 mm/Kconfig                |  44 +++
 mm/swap.c                 |  46 ++-
 mm/vmscan.c               | 784 +++++++++++++++++++++++++++++++++++++-
 mm/workingset.c           | 119 +++++-
 7 files changed, 1039 insertions(+), 5 deletions(-)

Message ID	20220208081902.3550911-6-yuzhao@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Tue, 8 Feb 2022 01:18:55 -0700 In-Reply-To: <20220208081902.3550911-1-yuzhao@google.com> Message-Id: <20220208081902.3550911-6-yuzhao@google.com> Mime-Version: 1.0 References: <20220208081902.3550911-1-yuzhao@google.com> Subject: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation From: Yu Zhao <yuzhao@google.com> To: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, Mel Gorman <mgorman@suse.de>, Michal Hocko <mhocko@kernel.org> Cc: Andi Kleen <ak@linux.intel.com>, Aneesh Kumar <aneesh.kumar@linux.ibm.com>, Barry Song <21cnbao@gmail.com>, Catalin Marinas <catalin.marinas@arm.com>, Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>, Jens Axboe <axboe@kernel.dk>, Jesse Barnes <jsbarnes@google.com>, Jonathan Corbet <corbet@lwn.net>, Linus Torvalds <torvalds@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, Michael Larabel <Michael@michaellarabel.com>, Mike Rapoport <rppt@kernel.org>, Rik van Riel <riel@surriel.com>, Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>, Ying Huang <ying.huang@intel.com>, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, page-reclaim@google.com, x86@kernel.org, Yu Zhao <yuzhao@google.com>, Brian Geffon <bgeffon@google.com>, Jan Alexander Steffens <heftig@archlinux.org>, Oleksandr Natalenko <oleksandr@natalenko.name>, Steven Barrett <steven@liquorix.net>, Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>, Donald Carr <d@chaos-reins.com>, " =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>, Konstantin Kharlamov <Hi-Angel@yandex.ru>, Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Multigenerational LRU Framework \| expand [v7,00/12] Multigenerational LRU Framework [v7,01/12] mm: x86, arm64: add arch_has_hw_pte_young() [v7,02/12] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG [v7,03/12] mm/vmscan.c: refactor shrink_node() [v7,04/12] mm: multigenerational LRU: groundwork [v7,05/12] mm: multigenerational LRU: minimal implementation [v7,06/12] mm: multigenerational LRU: exploit locality in rmap [v7,07/12] mm: multigenerational LRU: support page table walks [v7,08/12] mm: multigenerational LRU: optimize multiple memcgs [v7,09/12] mm: multigenerational LRU: runtime switch [v7,10/12] mm: multigenerational LRU: thrashing prevention [v7,11/12] mm: multigenerational LRU: debugfs interface [v7,12/12] mm: multigenerational LRU: documentation

[v7,05/12] mm: multigenerational LRU: minimal implementation

Commit Message

Comments

Patch