mbox series

[v6,0/9] Multigenerational LRU Framework

Message ID 20220104202227.2903605-1-yuzhao@google.com (mailing list archive)
Headers show
Series Multigenerational LRU Framework | expand

Message

Yu Zhao Jan. 4, 2022, 8:22 p.m. UTC
TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.

Design objectives
=================
The design objectives are:
1. Better representation of access recency
2. Try to profit from spatial locality
3. Clear fast path making obvious choices
4. Simple self-correcting heuristics

The representation of access recency is at the core of all LRU
approximations. The multigenerational LRU (MGLRU) divides pages into
multiple lists (generations), each having bounded access recency (a
time interval). Generations establish a common frame of reference and
help make better choices, e.g., between different memcgs on a computer
or different computers in a data center (for cluster job scheduling).

Exploiting spatial locality improves the efficiency when gathering the
accessed bit. A rmap walk targets a single page and doesn't try to
profit from discovering an accessed PTE. A page table walk can sweep
all hotspots in an address space, but its search space can be too
large to make a profit. The key is to optimize both methods and use
them in combination. (PMU is another option for further exploration.)

Fast path reduces code complexity and runtime overhead. Unmapped pages
don't require TLB flushes; clean pages don't require writeback. These
facts are only helpful when other conditions, e.g., access recency,
are similar. With generations as a common frame of reference,
additional factors stand out. But obvious choices might not be good
choices; thus self-correction is required (the next objective).

The benefits of simple self-correcting heuristics are self-evident.
Again with generations as a common frame of reference, this becomes
attainable. Specifically, pages in the same generation are categorized
based on additional factors, and a closed-loop control statistically
compares the refault percentages across all categories and throttles
the eviction of those that have higher percentages.

Patchset overview
=================
1. mm: x86, arm64: add arch_has_hw_pte_young()
2. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Materializing hardware optimizations when trying to clear the accessed
bit in many PTEs. If hardware automatically sets the accessed bit in
PTEs, there is no need to worry about bursty page faults (emulating
the accessed bit). If it also sets the accessed bit in non-leaf PMD
entries, there is no need to search the PTE table pointed to by a PMD
entry that doesn't have the accessed bit set.

3. mm/vmscan.c: refactor shrink_node()
A minor refactor.

4. mm: multigenerational lru: groundwork
Adding the basic data structure and the functions to initialize it and
insert/remove pages.

5. mm: multigenerational lru: mm_struct list
An infra keeps track of mm_struct's for page table walkers and
provides them with optimizations, i.e., switch_mm() tracking and Bloom
filters.

6. mm: multigenerational lru: aging
7. mm: multigenerational lru: eviction
"The page reclaim" is a producer/consumer model. "The aging" produces
cold pages, whereas "the eviction " consumes them. Cold pages flow
through generations. The aging uses the mm_struct list infra to sweep
dense hotspots in page tables. During a page table walk, the aging
clears the accessed bit and tags accessed pages with the youngest
generation number. The eviction sorts those pages when it encounters
them. For pages in the oldest generation, eviction walks the rmap to
check the accessed bit one more time before evicting them. During an
rmap walk, the eviction feeds dense hotspots back to the aging. Dense
hotspots flow through the Bloom filters. For pages not mapped in page
tables, the eviction uses the PID controller to statistically
determine whether they have higher refaults. If so, the eviction
throttles their eviction by moving them to the next generation (the
second oldest).

8. mm: multigenerational lru: user interface
The knobs to turn on/off MGLRU and provide the userspace with
thrashing prevention, working set estimation (the aging) and proactive
reclaim (the eviction).

9. mm: multigenerational lru: Kconfig
The Kconfig options.

Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
      Apache Cassandra      Memcached
      Apache Hadoop         MongoDB
      Apache Spark          PostgreSQL
      MariaDB (MySQL)       Redis

An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.

On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
   less wall time to sort three billion random integers, respectively,
   under the medium- and the high-concurrency conditions, when
   overcommitting memory. There were no statistically significant
   changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
   more transactions per minute (TPM), respectively, under the medium-
   and the high-concurrency conditions, when overcommitting memory.
   There were no statistically significant changes in TPM for the rest
   of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
   and [21.59, 30.02]% more operations per second (OPS), respectively,
   for sequential access, random access and Gaussian (distribution)
   access, when THP=always; 95% CIs [13.85, 15.97]% and
   [23.94, 29.92]% more OPS, respectively, for random access and
   Gaussian access, when THP=never. There were no statistically
   significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
   [2.16, 3.55]% more operations per second (OPS), respectively, for
   exponential (distribution) access, random access and Zipfian
   (distribution) access, when underutilizing memory; 95% CIs
   [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
   respectively, for exponential access, random access and Zipfian
   access, when overcommitting memory.

On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
   and [4.11, 7.50]% more operations per second (OPS), respectively,
   for exponential (distribution) access, random access and Zipfian
   (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
   [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
   exponential access, random access and Zipfian access, when swap was
   on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
   less average wall time to finish twelve parallel TeraSort jobs,
   respectively, under the medium- and the high-concurrency
   conditions, when swap was on. There were no statistically
   significant changes in average wall time for the rest of the
   benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
   minute (TPM) under the high-concurrency condition, when swap was
   off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
   respectively, under the medium- and the high-concurrency
   conditions, when swap was on. There were no statistically
   significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
   [11.47, 19.36]% more total operations per second (OPS),
   respectively, for sequential access, random access and Gaussian
   (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
   [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
   for sequential access, random access and Gaussian access, when
   THP=never.

Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
are popular among MM developers, but we prefer large-scale A/B
experiments to validate improvements.)
      fs_fio_bench_hdd_mq      pft
      fs_lmbench               pgsql-hammerdb
      fs_parallelio            redis
      fs_postmark              stream
      hackbench                sysbenchthread
      kernbench                tpcc_spark
      memcached                unixbench
      multichase               vm-scalability
      mutilate                 will-it-scale
      nginx

[01] https://trends.google.com
[02] https://lore.kernel.org/linux-mm/20211102002002.92051-1-bot@edi.works/
[03] https://lore.kernel.org/linux-mm/20211009054315.47073-1-bot@edi.works/
[04] https://lore.kernel.org/linux-mm/20211021194103.65648-1-bot@edi.works/
[05] https://lore.kernel.org/linux-mm/20211109021346.50266-1-bot@edi.works/
[06] https://lore.kernel.org/linux-mm/20211202062806.80365-1-bot@edi.works/
[07] https://lore.kernel.org/linux-mm/20211209072416.33606-1-bot@edi.works/
[08] https://lore.kernel.org/linux-mm/20211218071041.24077-1-bot@edi.works/
[09] https://lore.kernel.org/linux-mm/20211122053248.57311-1-bot@edi.works/
[10] https://lore.kernel.org/linux-mm/20220104202247.2903702-1-yuzhao@google.com/

Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin wrote [11]:
   I have Archlinux with 8G RAM + zswap + swap. While developing, I
   have lots of apps opened such as multiple LSP-servers for different
   langs, chats, two browsers, etc... Usually, my system gets quickly
   to a point of SWAP-storms, where I have to kill LSP-servers,
   restart browsers to free memory, etc, otherwise the system lags
   heavily and is barely usable.
   
   1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
   patchset, and I started up by opening lots of apps to create memory
   pressure, and worked for a day like this. Till now I had *not a
   single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
   getting to the point of 3G in SWAP before without a single
   SWAP-storm.

The Arch Linux Zen kernel [12] has been using MGLRU since 5.12. Many
of its users reported their positive experiences to me, e.g., Shivodit
wrote:
   I've tried the latest Zen kernel (5.14.13-zen1-1-zen in the
   archlinux testing repos), everything's been smooth so far. I also
   decided to copy a large volume of files to check performance under
   I/O load, and everything went smoothly - no stuttering was present,
   everything was responsive.

Large-scale deployments
-----------------------
We've rolled out MGLRU to tens of millions of Chrome OS users and
about a million Android users. Google's fleetwide profiling [13] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
rendering latency at the 50th percentile.

[11] https://lore.kernel.org/linux-mm/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
[12] https://github.com/zen-kernel/zen-kernel/
[13] https://research.google/pubs/pub44271/

Summery
=======
The facts are:
1. The independent lab results and the real-world applications
   indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
   work out of the box; there are no equivalent solutions.
3. There is a lot of new code; nobody has demonstrated smaller changes
   with similar effects.

Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
   materialize for a wide range of workloads.
2. Gauging the interest from the past discussions [14][15][16], the
   new features will likely be put to use for both personal computers
   and data centers.
3. Based on Google's track record, the new code will likely be well
   maintained in the long term. It'd be more difficult if not
   impossible to achieve similar effects on top of the existing
   design.

[14] https://lore.kernel.org/lkml/20201005081313.732745-1-andrea.righi@canonical.com/
[15] https://lore.kernel.org/lkml/20210716081449.22187-1-sj38.park@gmail.com/
[16] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/

Yu Zhao (9):
  mm: x86, arm64: add arch_has_hw_pte_young()
  mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  mm/vmscan.c: refactor shrink_node()
  mm: multigenerational lru: groundwork
  mm: multigenerational lru: mm_struct list
  mm: multigenerational lru: aging
  mm: multigenerational lru: eviction
  mm: multigenerational lru: user interface
  mm: multigenerational lru: Kconfig

 Documentation/vm/index.rst          |    1 +
 Documentation/vm/multigen_lru.rst   |   80 +
 arch/Kconfig                        |    9 +
 arch/arm64/include/asm/cpufeature.h |    5 +
 arch/arm64/include/asm/pgtable.h    |   13 +-
 arch/arm64/kernel/cpufeature.c      |   19 +
 arch/arm64/tools/cpucaps            |    1 +
 arch/x86/Kconfig                    |    1 +
 arch/x86/include/asm/pgtable.h      |    9 +-
 arch/x86/mm/pgtable.c               |    5 +-
 fs/exec.c                           |    2 +
 fs/fuse/dev.c                       |    3 +-
 include/linux/cgroup.h              |   15 +-
 include/linux/memcontrol.h          |   11 +
 include/linux/mm.h                  |   42 +
 include/linux/mm_inline.h           |  204 ++
 include/linux/mm_types.h            |   78 +
 include/linux/mmzone.h              |  175 ++
 include/linux/nodemask.h            |    1 +
 include/linux/oom.h                 |   16 +
 include/linux/page-flags-layout.h   |   19 +-
 include/linux/page-flags.h          |    4 +-
 include/linux/pgtable.h             |   17 +-
 include/linux/sched.h               |    4 +
 include/linux/swap.h                |    4 +
 kernel/bounds.c                     |    3 +
 kernel/cgroup/cgroup-internal.h     |    1 -
 kernel/exit.c                       |    1 +
 kernel/fork.c                       |    9 +
 kernel/sched/core.c                 |    1 +
 mm/Kconfig                          |   48 +
 mm/huge_memory.c                    |    3 +-
 mm/memcontrol.c                     |   26 +
 mm/memory.c                         |   21 +-
 mm/mm_init.c                        |    6 +-
 mm/oom_kill.c                       |    4 +-
 mm/page_alloc.c                     |    1 +
 mm/rmap.c                           |    7 +
 mm/swap.c                           |   51 +-
 mm/vmscan.c                         | 2691 ++++++++++++++++++++++++++-
 mm/workingset.c                     |  119 +-
 41 files changed, 3591 insertions(+), 139 deletions(-)
 create mode 100644 Documentation/vm/multigen_lru.rst

Comments

Yu Zhao Jan. 4, 2022, 8:30 p.m. UTC | #1
On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and it
> often makes poor choices about what to evict. This patchset offers an
> alternative solution that is performant, versatile and
> straightforward.

<snipped>

> Summery
> =======
> The facts are:
> 1. The independent lab results and the real-world applications
>    indicate substantial improvements; there are no known regressions.
> 2. Thrashing prevention, working set estimation and proactive reclaim
>    work out of the box; there are no equivalent solutions.
> 3. There is a lot of new code; nobody has demonstrated smaller changes
>    with similar effects.
> 
> Our options, accordingly, are:
> 1. Given the amount of evidence, the reported improvements will likely
>    materialize for a wide range of workloads.
> 2. Gauging the interest from the past discussions [14][15][16], the
>    new features will likely be put to use for both personal computers
>    and data centers.
> 3. Based on Google's track record, the new code will likely be well
>    maintained in the long term. It'd be more difficult if not
>    impossible to achieve similar effects on top of the existing
>    design.

Hi Andrew, Linus,

Can you please take a look at this patchset and let me know if it's
5.17 material?

My goal is to get it merged asap so that users can reap the benefits
and I can push the sequels. Please examine the data provided -- I
think the unprecedented coverage and the magnitude of the improvements
warrant a green light.

Thanks!
Linus Torvalds Jan. 4, 2022, 9:43 p.m. UTC | #2
On Tue, Jan 4, 2022 at 12:30 PM Yu Zhao <yuzhao@google.com> wrote:
>
> My goal is to get it merged asap so that users can reap the benefits
> and I can push the sequels. Please examine the data provided -- I
> think the unprecedented coverage and the magnitude of the improvements
> warrant a green light.

I'll leave this to Andrew. I had some stylistic nits, but all the
actual complexity is in that aging and eviction, and while I looked at
the patches, I certainly couldn't make much of a judgement on them.

The proof is in the numbers, and they look fine, but who knows what
happens when others test it. I don't see anything that looks worrisome
per se, I just see the silly small things that made me go "Eww".

             Linus
Shuang Zhai Jan. 5, 2022, 2:44 a.m. UTC | #3
Fio / pmem benchmark with MGLRU

TLDR
====
With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
and [9.26, 10.36]% higher throughput, respectively, for random
access, Zipfian (distribution) access and Gaussian (distribution)
access, when the average number of jobs per CPU is 1; 95% CIs
[42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput,
respectively, for random access, Zipfian access and Gaussian access,
when the average number of jobs per CPU is 2.

Background
==========
Many applications running on warehouse-scale computers heavily use
POSIX read(2)/write(2) and page cache, e.g., Apache Kafka, a
distributed streaming application used by "more than 80% of all
Fortune 100 companies" [1] and PostgreSQL, "the world's most advanced
open source relational database" [2].

Intel DC Persistent Memory, as an affordable alternative to DRAM, can
deliver large capacity and data persistence. Specifically, the device
used in this benchmark can achieve up to 36 GiB/s and 15 GiB/s
throughput, respectively, for sequential and random read access.

Our research group at the University of Rochester focuses on the
intersection of computer architecture and system software. My current
research interest is memory management on tiered memory systems.

Matrix
======
Kernels: version [+ patchset]
* Baseline: 5.15
* Patched: 5.15 + MGLRU

Access patterns (4KB read):
* Random (uniform)
* Zipfian (theta 0.8; the recommended range is 0-2)
* Gaussian (deviation 40; the possible range is 0-100)

Concurrency conditions (the average number of jobs per CPU):
* 1
* 2

Total file size (GB): 400 (~2x memory capacity)
Total configurations: 12
Data points per configuration: 10
Total run duration (minutes) per data point: ~30

Notes
-----
1. All files were stored on pmem. Each job had the exclusive access to
   a single file.
2. Due to the hardware limitation when accessing remote pmem [3],
   numactl was used to bind the fio processes to the local pmem. Only
   one of the two NUMA nodes was used during the benchmark.
3. During dry runs, we observed that the throughput doesn't improve
   beyond 2 jobs per CPU for random access. Moreover, the patched
   kernel showed consistent improvements over the baseline kernel
   when using 3 or 4 jobs per CPU.
4. We wanted to simulate the real-world scenarios and therefore used
   default swap configuration (on). Moreover, we didn't observe any
   negative impact on performance with dry runs that disabled swap.

Procedure
=========
<for each kernel>
    grub2-reboot <baseline, patched>
    <for each concurrency condition>
        <generate test files>
        <for each access pattern>
            <for each data point>
                <reboot>
                <run fio>

Hardware
--------
Memory (GiB per socket): 192
CPU (# per socket): 40
Pmem (GiB per socket): 768

Fio
---
$ fio -version
fio-3.28

$ numactl --cpubind=0 --membind=0 fio --name=randread \
  --directory=/mnt/pmem/ --size={10G, 5G} --io_size=1000TB \
  --time_based --numjobs={40, 80} --ioengine=io_uring \
  --ramp_time=20m --runtime=10m --iodepth=128 \
  --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
  --rw=randread --random_distribution={random, zipf:0.8, normal:40} \
  --direct=0 --norandommap --group_reporting

Results
=======
Throughput
----------
The patched kernel achieved substantially higher throughput for all
three access patterns and two concurrency conditions. Specifically,
comparing the patched with the baseline kernel, fio achieved 95% CIs
[38.95, 40.26]%, [4.12, 6.64]% and [9.26, 10.36]% higher throughput,
respectively, for random access, Zipfian access, and Gaussian access,
when the average number of jobs per CPU is 1; 95% CIs [42.32, 49.15]%,
[9.44, 9.89]% and [20.99, 22.86]% higher throughput, respectively, for
random access, Zipfian access and Gaussian access, when the average
number of jobs per CPU is 2.

+---------------------+---------------+---------------+
| Mean MiB/s [95% CI] | 1 job / CPU   | 2 jobs / CPU  |
+---------------------+---------------+---------------+
| Random access       | 8411 / 11742  | 8417 / 12267  |
|                     | [3275, 3387]  | [3562, 4137]  |
+---------------------+---------------+---------------+
| Zipfian access      | 14576 / 15360 | 12932 / 14181 |
|                     | [600, 967]    | [1220, 1279]  |
+---------------------+---------------+---------------+
| Gaussian access     | 14564 / 15993 | 11513 / 14037 |
|                     | [1348, 1508]  | [2417, 2631]  |
+---------------------+---------------+---------------+
Table 1. Throughput comparison between the baseline and the patched
         kernels

The patched kernel exhibited less degradation in throughput when
running more concurrent jobs. Comparing 2 jobs per CPU with 1 job per
CPU, fio achieved 95% CIs [-11.54, -11.02]%, [-16.91, -12.01]% and
[-21.61, -20.30]% higher throughput, respectively, for random access,
Zipfian access and Gaussian access, when using the baseline kernel;
95% CIs [2.04, 6.92]%, [-8.86, -6.48]% and [-12.83, -11.64]% higher
throughput, respectively, for random access, Zipfian access and
Gaussian access, when using the patched kernel. There were no
statistically significant changes in throughput for the rest of the
test matrix.

+---------------------+-----------------+----------------+
| Mean MiB/s [95% CI] | Baseline kernel | Patched kernel |
+---------------------+-----------------+----------------+
| Random access       | 8411 / 8417     | 11741 / 12267  |
|                     | [-55, 69]       | [239, 812]     |
+---------------------+-----------------+----------------+
| Zipfian access      | 14576 / 12932   | 15360/ 14181   |
|                     | [-1682, -1607]  | [-1361, -996]  |
+---------------------+-----------------+----------------+
| Gaussian access     | 14565 / 11513   | 15993 / 14037  |
|                     | [-3147, -2957]  | [-2051, -1861] |
+---------------------+-----------------+----------------+
Table 2. Throughput comparison between 1 job per CPU and 2 jobs per
         CPU

Tail Latency
------------
Comparing the patched with the baseline kernel, fio experienced 95%
CIs [-41.77, -40.35]% and [6.64, 13.95]% higher latency at the 99th
percentile, respectively, for random access and Gaussian access, when
the average number of jobs per CPU is 1; 95% CIs [-41.97, -40.59]%,
[-47.74, -47.04]% and [-51.32, -50.27]% higher latency at the 99th
percentile, respectively, for random access, Zipfian access and
Gaussian access, when the average number of jobs per CPU is 2. There
were no statistically significant changes in latency at the 99th
percentile for the rest of the test matrix.

+------------------------------+----------------+------------------+
| 99th percentile latency (us) | 1 job / CPU    | 2 jobs / CPU     |
+------------------------------+----------------+------------------+
| Random access                | 12466 / 7347   | 25560 / 15008    |
|                              | [-5207, -5030] | [-10729, -10375] |
+------------------------------+----------------+------------------+
| Zipfian access               | 3395 / 3382    | 14563 / 7661     |
|                              | [-131, 105]    | [-6953,-6850]    |
+------------------------------+----------------+------------------+
| Gaussian access              | 3280 / 3618    | 15611 / 7681     |
|                              | [217, 457]     | [-8012, -7848]   |
+------------------------------+----------------+------------------+
Table 3. Comparison of the 99th percentile latency between the
         baseline and the patched kernels (lower is better)

Metrics collected during each run are available at:
https://github.com/zhaishuang1/MglruPerf/tree/master

A peek at 5.16-rc6
------------------
We also ran the benchmark on 5.16-rc6 with swap off. However, we
haven't collected enough data points to establish a 95% CI. Here are
a few numbers we've collected:

+----------------+------------+----------+----------------+----------+
| Access pattern | Jobs / CPU | 5.16-rc6 | 5.16-rc6-mglru | % change |
+----------------+------------+----------+----------------+----------+
| Random access  | 1          | 7467     | 10440          | 39.8%    |
+----------------+------------+----------+----------------+----------+
| Random access  | 2          | 7504     | 13417          | 78.8%    |
+----------------+------------+----------+----------------+----------+
| Random access  | 3          | 7511     | 13954          | 85.8%    |
+----------------+------------+----------+----------------+----------+
| Random access  | 4          | 7542     | 13925          | 84.6%    |
+----------------+------------+----------+----------------+----------+

Reference
=========
[1] https://kafka.apache.org/documentation/#design_filesystem
[2] https://www.postgresql.org/docs/11/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY
[3] System Evaluation of the Intel Optane byte-addressable NVM, MEMSYS 2019.

Appendix
========
Throughput
----------
$ cat raw_data_fio.r
v <- c(
    # baseline 40 procs random
    8467.89, 8428.34, 8383.32, 8253.12, 8464.65, 8307.42, 8424.78, 8434.44, 8474.88, 8468.26,
    # baseline 40 procs zipf
    14570.44, 14598.03, 14550.74, 14640.29, 14591.4, 14573.35, 14503.18, 14613.39, 14598.61, 14522.27,
    # baseline 40 procs gaussian
    14504.95, 14427.23, 14652.19, 14519.47, 14557.97, 14617.92, 14555.87, 14446.94, 14678.12, 14688.33,
    # baseline 80 procs random
    8427.51, 8267.23, 8437.48, 8432.37, 8441.4, 8454.26, 8413.13, 8412.44, 8444.36, 8444.32,
    # baseline 80 procs zipf
    12980.12, 12946.43, 12911.95, 12925.83, 12952.75, 12841.44, 12920.35, 12924.19, 12944.38, 12967.72,
    # baseline 80 procs gaussian
    11666.29, 11624.72, 11454.82, 11482.36, 11462.24, 11379.46, 11691.5, 11471.19, 11402.08, 11494.13,
    # patched 40 procs random
    11706.69, 11778.1, 11774.07, 11750.07, 11744.97, 11766.65, 11727.79, 11708.41, 11745.3, 11716.45,
    # patched 40 procs zipf
    15498.31, 14647.94, 15423.35, 15467.32, 15467.05, 15342.49, 15511.34, 15414.06, 15401.1, 15431.57,
    # patched 40 procs gaussian
    15957.86, 15957.13, 16022.69, 16035.85, 16150.2, 15904.5, 15943.36, 16036.78, 16025.95, 15900.56,
    # patched 80 procs random
    12568.51, 11772.25, 11622.15, 12057.66, 11971.72, 12693.36, 12399.71, 12553.23, 12242.74, 12793.34,
    # patched 80 procs zipf
    14194.78, 14213.61, 14148.66, 14182.35, 14183.91, 14192.23, 14163.2, 14179.7, 14162.12, 14196.34,
    # patched 80 procs gaussian
    14084.86, 13706.34, 14089.42, 14058.4, 14096.74, 14108.06, 14043.41, 14072.15, 14088.44, 14024.51
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (concurr in 1:2) {
    for (dist in 1:3) {
        r <- t.test(a[, dist, concurr, 1], a[, dist, concurr, 2])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("concurr%d dist%d: no significance", concurr, dist)
        } else {
            s <- sprintf("concurr%d dist%d: [%.2f, %.2f]%%", concurr, dist, -p[2], -p[1])
        }
        print(s)
    }
}

# low concurr vs high concurr
for (kern in 1:2) {
    for (dist in 1:3) {
        r <- t.test(a[, dist, 1, kern], a[, dist, 2, kern])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("kern%d dist%d: no significance", kern, dist)
        } else {
            s <- sprintf("kern%d dist%d: [%.2f, %.2f]%%", kern, dist, -p[2], -p[1])
        }
        print(s)
    }
}

$ R -q -s -f raw_data_fio.r

	Welch Two Sample t-test

data:  a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = -132.15, df = 11.177, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3386.514 -3275.766
sample estimates:
mean of x mean of y
  8410.71  11741.85

[1] "concurr1 dist1: [38.95, 40.26]%"

	Welch Two Sample t-test

data:  a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = -9.5917, df = 9.4797, p-value = 3.463e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -967.8353 -600.7307
sample estimates:
mean of x mean of y
 14576.17  15360.45

[1] "concurr1 dist2: [4.12, 6.64]%"

	Welch Two Sample t-test

data:  a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = -37.744, df = 17.33, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1508.328 -1348.850
sample estimates:
mean of x mean of y
 14564.90  15993.49

[1] "concurr1 dist3: [9.26, 10.36]%"

	Welch Two Sample t-test

data:  a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = -30.144, df = 9.3334, p-value = 1.281e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4137.381 -3562.653
sample estimates:
mean of x mean of y
  8417.45  12267.47

[1] "concurr2 dist1: [42.32, 49.15]%"

	Welch Two Sample t-test

data:  a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = -92.164, df = 13.276, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1279.417 -1220.931
sample estimates:
mean of x mean of y
 12931.52  14181.69

[1] "concurr2 dist2: [9.44, 9.89]%"

	Welch Two Sample t-test

data:  a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = -49.453, df = 17.863, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2631.656 -2417.052
sample estimates:
mean of x mean of y
 11512.88  14037.23

[1] "concurr2 dist3: [20.99, 22.86]%"

	Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = -0.22947, df = 16.403, p-value = 0.8213
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -68.88155  55.40155
sample estimates:
mean of x mean of y
  8410.71   8417.45

[1] "kern1 dist1: no significance"

	Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 91.86, df = 17.875, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1607.021 1682.287
sample estimates:
mean of x mean of y
 14576.17  12931.52

[1] "kern1 dist2: [-11.54, -11.02]%"

	Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 67.477, df = 17.539, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2956.815 3147.225
sample estimates:
mean of x mean of y
 14564.90  11512.88

[1] "kern1 dist3: [-21.61, -20.30]%"

	Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = -4.1443, df = 9.0781, p-value = 0.002459
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -812.1507 -239.0833
sample estimates:
mean of x mean of y
 11741.85  12267.47

[1] "kern2 dist1: [2.04, 6.92]%"

	Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 14.566, df = 9.1026, p-value = 1.291e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  996.0064 1361.5196
sample estimates:
mean of x mean of y
 15360.45  14181.69

[1] "kern2 dist2: [-8.86, -6.48]%"

	Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 43.826, df = 15.275, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1861.263 2051.247
sample estimates:
mean of x mean of y
 15993.49  14037.23

[1] "kern2 dist3: [-12.83, -11.64]%"

99th Percentile Latency
-----------------------
$ cat raw_data_fio_lat.r
v <- c(
    # baseline 40 procs random
    12649, 12387, 12518, 12518, 12518, 12387, 12518, 12518, 12387, 12256,
    # baseline 40 procs zipf
    3458, 3294, 3425, 3294, 3294, 3359, 3752, 3326, 3294, 3458,
    # baseline 40 procs gaussian
    3326, 3458, 3195, 3392, 3326, 3228, 3228, 3326, 3130, 3195,
    # baseline 80 procs random
    25560, 26084, 25560, 25560, 25297, 25297, 25822, 25560, 25560, 25297,
    # baseline 80 procs zipf
    14484, 14615, 14615, 14484, 14484, 14615, 14615, 14615, 14615, 14484,
    # baseline 80 procs gaussian
    15664, 15664, 15533, 15533, 15533, 15664, 15795, 15533, 15664, 15533,
    # patched 40 procs random
    7439, 7242, 7373, 7373, 7373, 7439, 7242, 7308, 7308, 7373,
    # patched 40 procs zipf
    3261, 3425, 3392, 3294, 3359, 3556, 3228, 3490, 3458, 3359,
    # patched 40 procs gaussian
    3687, 3523, 3556, 3523, 3752, 3654, 3884, 3490, 3392, 3720,
    # patched 80 procs random
    15008, 15008, 15008, 15008, 15008, 15008, 15008, 15008, 15008, 15008,
    # patched 80 procs zipf
    7701, 7635, 7701, 7701, 7635, 7635, 7701, 7635, 7635, 7635,
    # patched 80 procs gaussian
    7635, 7898, 7701, 7635, 7635, 7635, 7635, 7635, 7701, 7701
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (concurr in 1:2) {
    for (dist in 1:3) {
        r <- t.test(a[, dist, concurr, 1], a[, dist, concurr, 2])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("concurr%d dist%d: no significance", concurr, dist)
        } else {
            s <- sprintf("concurr%d dist%d: [%.2f, %.2f]%%", concurr, dist, -p[2], -p[1])
        }
        print(s)
    }
}

$ R -q -s -f raw_data_fio_lat.r

	Welch Two Sample t-test

data:  a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = 123.52, df = 15.287, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 5030.417 5206.783
sample estimates:
mean of x mean of y
  12465.6    7347.0

[1] "concurr1 dist1: [-41.77, -40.35]%"

	Welch Two Sample t-test

data:  a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = 0.23667, df = 16.437, p-value = 0.8158
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -104.7812  131.1812
sample estimates:
mean of x mean of y
   3395.4    3382.2

[1] "concurr1 dist2: no significance"

	Welch Two Sample t-test

data:  a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = -5.9754, df = 16.001, p-value = 1.94e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -457.5065 -217.8935
sample estimates:
mean of x mean of y
   3280.4    3618.1

[1] "concurr1 dist3: [6.64, 13.95]%"

	Welch Two Sample t-test

data:  a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = 134.89, df = 9, p-value = 3.437e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 10374.74 10728.66
sample estimates:
mean of x mean of y
  25559.7   15008.0

[1] "concurr2 dist1: [-41.97, -40.59]%"

	Welch Two Sample t-test

data:  a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = 288.1, df = 13.292, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 6849.566 6952.834
sample estimates:
mean of x mean of y
  14562.6    7661.4

[1] "concurr2 dist2: [-47.74, -47.04]%"

	Welch Two Sample t-test

data:  a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = 203.64, df = 17.798, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 7848.616 8012.384
sample estimates:
mean of x mean of y
  15611.6    7681.1

[1] "concurr2 dist3: [-51.32, -50.27]%"
SeongJae Park Jan. 5, 2022, 8:55 a.m. UTC | #4
Hi Yu,

On Tue, 4 Jan 2022 13:22:19 -0700 Yu Zhao <yuzhao@google.com> wrote:

> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and it
> often makes poor choices about what to evict. This patchset offers an
> alternative solution that is performant, versatile and
> straightforward.
>  
[...]
> Summery
> =======
> The facts are:
> 1. The independent lab results and the real-world applications
>    indicate substantial improvements; there are no known regressions.

So impressive results!

> 2. Thrashing prevention, working set estimation and proactive reclaim
>    work out of the box; there are no equivalent solutions.

I think similar works are already available out of the box with the latest
mainline tree, though it might be suboptimal in some cases.

First, you can do thrashing prevention using DAMON-based Operation Scheme
(DAMOS)[1] with MADV_COLD action.  Second, for working set estimation, you can
either use the DAMOS again with statistics action, or the damon_aggregated
tracepoint[2].  The DAMON user space tool[3] helps the tracepoint analysis and
visualization.  Finally, for the proactive reclaim, you can again use the DAMOS
with MADV_PAGEOUT action, or simply the DAMON-based proactive reclaim
module (DAMON_RECLAIM)[4].

Nevertheless, as noted above, current DAMON based solutions might be suboptimal
for some cases.  First of all, DAMON currently doesn't provide page granularity
monitoring.  Though its monitoring results were useful for our users'
production usages, there could be different requirements and situations.
Secondly, the DAMON-based thrashing prevention wouldn't reduce the CPU usage of
the reclamation logic's access scanning.

So, to me, MGLRU patchset looks providing something that DAMON doesn't provide,
but also something that DAMON is already providing.  Specifically, the
efficient page granularity access scanning is what DAMON doesn't provide for
now.  However, the utilization of the access information for LRU list
manipulation (thrashing prevention) and proactive reclamation is similar to
what DAMON (specifically, DAMOS) provides.  Also, this patchset is reducing the
reclamation logic's CPU usage using the efficient page granularity access
scanning.

IMHO, we might be able to reduce the duplicates by integrating MGLRU in DAMON.
What I'm saying is, we could 1) introduce the efficient page granularity access
scanning, 2) reduce the reclamation logic's CPU usage by making it to use the
efficient page granularity access scanning, and 3) extend DAMON for page
granularity monitoring with the efficient access sacanning[5].  Then, users
could get the benefit of MGLRU by using DAMOS but setting it to use your
efficient page granularity access scanning.  To make it more simple, we can
extend existing kernel logics to use DAMON in the way, or implement a new
kernel module.  Additional advantages of this approach would be 1) reducing the
changes to the existing code, and 2) making the efficient page granularity
access information be utilized for more general cases.

Of course, the integration might not be so simple as seems to me now.  We could
put DAMON and MGLRU together as those are for now, and let users select what
they really want.  I think it's up to you.

I didn't read this patchset thoroughly yet, so I might missing many things.  If
so, please feel free to let me know.

[1] https://docs.kernel.org/admin-guide/mm/damon/usage.html#schemes
[2] https://docs.kernel.org/admin-guide/mm/damon/usage.html#tracepoint-for-monitoring-results
[3] https://github.com/awslabs/damo
[4] https://docs.kernel.org/admin-guide/mm/damon/reclaim.html
[5] https://docs.kernel.org/vm/damon/design.html#configurable-layers


Thanks,
SJ

[...]
Yu Zhao Jan. 5, 2022, 10:53 a.m. UTC | #5
On Wed, Jan 05, 2022 at 08:55:34AM +0000, SeongJae Park wrote:
> Hi Yu,
> 
> On Tue, 4 Jan 2022 13:22:19 -0700 Yu Zhao <yuzhao@google.com> wrote:
> 
> > TLDR
> > ====
> > The current page reclaim is too expensive in terms of CPU usage and it
> > often makes poor choices about what to evict. This patchset offers an
> > alternative solution that is performant, versatile and
> > straightforward.
> >  
> [...]
> > Summery
> > =======
> > The facts are:
> > 1. The independent lab results and the real-world applications
> >    indicate substantial improvements; there are no known regressions.
> 
> So impressive results!
> 
> > 2. Thrashing prevention, working set estimation and proactive reclaim
> >    work out of the box; there are no equivalent solutions.
> 
> I think similar works are already available out of the box with the latest
> mainline tree, though it might be suboptimal in some cases.

Ok, I will sound harsh because I hate it when people challenge facts
while having no idea what they are talking about.

Our jobs are help the leadership make best decisions by providing them
with facts, not feeding them crap.

Don't get me wrong -- you are welcome to start another thread and have
a casual discussion with me. But this thread is not for that; it's for
the leadership and stakeholder to make a decision. Check who are in
"To" and "Cc" and what my request is.

> I didn't read this patchset thoroughly yet, so I might missing many things.  If
> so, please feel free to let me know.

Yes, apparently you didn't read this patchset thoroughly, and you have
missed all things that matter to this thread.

> First, you can do thrashing prevention using DAMON-based Operation Scheme
> (DAMOS)[1] with MADV_COLD action.

Here is thrashing prevention really means, from patch 8:
  +Personal computers
  +------------------
  +:Thrashing prevention: Write ``N`` to
  + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
  + ``N`` milliseconds from getting evicted. The OOM killer is invoked if
  + this working set can't be kept in memory. Based on the average human
  + detectable lag (~100ms), ``N=1000`` usually eliminates intolerable
  + lags due to thrashing. Larger values like ``N=3000`` make lags less
  + noticeable at the cost of more OOM kills.

It's about when to trigger OOM kills. Got it? Or probably you don't
understand what MADV_COLD is either?

> Second, for working set estimation, you can either use the DAMOS
> again with statistics action, or the damon_aggregated tracepoint[2].

This is you are suggesting:
  TRACE_EVENT(damon_aggregated,
    TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u",
              __entry->target_id, __entry->nr_regions,
              __entry->start, __entry->end, __entry->nr_accesses)

Now read my doc again:
  +Data centers
  +------------
  +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
  + format:
  +   memcg  memcg_id  memcg_path
  +     node  node_id

Have you heard of something called memcg? And NUMA node? How exactly
can this tracepoint provide information about different memcgs and
NUMA node?

> The DAMON user space tool[3] helps the tracepoint analysis and
> visualization.

What does "work out of box" mean? Should every Linux desktop, laptop
and phone user install this tool?

> Finally, for the proactive reclaim, you can again use the DAMOS
> with MADV_PAGEOUT action

How exactly does MADV_PAGEOUT find pages that are NOT mapped in page
tables? Let me tell you another fact: they are usually the cheapest to
reclaim.

> or simply the DAMON-based proactive reclaim module (DAMON_RECLAIM)[4].
> [4] https://docs.kernel.org/admin-guide/mm/damon/reclaim.html

How many knob does DAMON_RECLAIM have? 14? I lost count.

> Of course, the integration might not be so simple as seems to me now.

Look, I'm open to your suggestion. I probably should have been nicer.
So I'm sorry. I just don't appreciate alternative facts.
Borislav Petkov Jan. 5, 2022, 11:12 a.m. UTC | #6
On Wed, Jan 05, 2022 at 03:53:07AM -0700, Yu Zhao wrote:
> Look, I'm open to your suggestion. I probably should have been nicer.
> So I'm sorry. I just don't appreciate alternative facts.

Yes, you should've been *much* nicer. I'm reading lkml for pretty much
20 years now and you just made my eyebrows go up - something which
pretty much never happens these days.

So you need to check yourself before replying. Looking at git history,
you're not a newbie so you've probably picked up - at least from the
sidelines - all those code of conduct discussions. And I'm not going to
point you to it - I'm sure you can find it yourself and peruse it at
your own convenience.

Long story short: we all try to be civil to each other now, even if it
is hard sometimes.
SeongJae Park Jan. 5, 2022, 11:25 a.m. UTC | #7
On Wed, 5 Jan 2022 03:53:07 -0700 Yu Zhao <yuzhao@google.com> wrote:

> On Wed, Jan 05, 2022 at 08:55:34AM +0000, SeongJae Park wrote:
> > Hi Yu,
> > 
> > On Tue, 4 Jan 2022 13:22:19 -0700 Yu Zhao <yuzhao@google.com> wrote:
[...]
> > I think similar works are already available out of the box with the latest
> > mainline tree, though it might be suboptimal in some cases.
> 
> Ok, I will sound harsh because I hate it when people challenge facts
> while having no idea what they are talking about.
> 
> Our jobs are help the leadership make best decisions by providing them
> with facts, not feeding them crap.

I was using the word "similar", to represent this is only for a rough concept
level similarity, rather than detailed facts.  But, seems it was not enough,
sorry.  Anyway, I will not talk more and thus disturb you having the important
discussion with leaders here, as you are asking.

> 
> Don't get me wrong -- you are welcome to start another thread and have
> a casual discussion with me. But this thread is not for that; it's for
> the leadership and stakeholder to make a decision. Check who are in
> "To" and "Cc" and what my request is.

Haha.  Ok, good luck for you.


Thanks,
SJ

[...]
Yu Zhao Jan. 5, 2022, 9:06 p.m. UTC | #8
On Wed, Jan 05, 2022 at 11:25:27AM +0000, SeongJae Park wrote:
> On Wed, 5 Jan 2022 03:53:07 -0700 Yu Zhao <yuzhao@google.com> wrote:
> 
> > On Wed, Jan 05, 2022 at 08:55:34AM +0000, SeongJae Park wrote:
> > > Hi Yu,
> > > 
> > > On Tue, 4 Jan 2022 13:22:19 -0700 Yu Zhao <yuzhao@google.com> wrote:
> [...]
> > > I think similar works are already available out of the box with the latest
> > > mainline tree, though it might be suboptimal in some cases.
> > 
> > Ok, I will sound harsh because I hate it when people challenge facts
> > while having no idea what they are talking about.
> > 
> > Our jobs are help the leadership make best decisions by providing them
> > with facts, not feeding them crap.
> 
> I was using the word "similar", to represent this is only for a rough concept
> level similarity, rather than detailed facts.  But, seems it was not enough,
> sorry.  Anyway, I will not talk more and thus disturb you having the important
> discussion with leaders here, as you are asking.

First of all, I want to apologize.

I detested what I read, and I still don't like "a rough concept level
similarity" sitting next to a factual statement. But as Borislav has
reminded me, my tone did cross the line. I should have had used an
objective approach to express my (very) different views.

I hope that's all water under the bridge now. And I do plan to carry
on with what I should have had done.
Yu Zhao Jan. 5, 2022, 9:12 p.m. UTC | #9
On Tue, Jan 04, 2022 at 01:43:13PM -0800, Linus Torvalds wrote:
> On Tue, Jan 4, 2022 at 12:30 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > My goal is to get it merged asap so that users can reap the benefits
> > and I can push the sequels. Please examine the data provided -- I
> > think the unprecedented coverage and the magnitude of the improvements
> > warrant a green light.
> 
> I'll leave this to Andrew. I had some stylistic nits, but all the
> actual complexity is in that aging and eviction, and while I looked at
> the patches, I certainly couldn't make much of a judgement on them.
> 
> The proof is in the numbers, and they look fine, but who knows what
> happens when others test it. I don't see anything that looks worrisome
> per se, I just see the silly small things that made me go "Eww".

I appreciate your time, I'll address all your comments togather with
others' in the next spin, after I hear from Andrew. (I'm assuming he
will have comments too.)
Michal Hocko Jan. 7, 2022, 9:38 a.m. UTC | #10
On Tue 04-01-22 13:30:00, Yu Zhao wrote:
[...]
> Hi Andrew, Linus,
> 
> Can you please take a look at this patchset and let me know if it's
> 5.17 material?

I am still not done with the review and have seen at least few problems
that would need to be addressed.

But more fundamentally I believe there are really some important
questions to be answered. First and foremost this is a major addition
to the memory reclaim and there should be a wider consensus that we
really want to go that way. The patchset doesn't have a single ack nor
reviewed-by AFAICS. I haven't seen a lot of discussion since v2
(http://lkml.kernel.org/r/20210413065633.2782273-1-yuzhao@google.com)
nor do I see any clarification on how concerns raised there have been
addressed or at least how they are planned to be addressed.

Johannes has made some excellent points
http://lkml.kernel.org/r/YHcpzZYD2fQyWvEQ@cmpxchg.org. Let me quote
for reference part of it I find the most important:
: Realistically, I think incremental changes are unavoidable to get this
: merged upstream.
: 
: Not just in the sense that they need to be smaller changes, but also
: in the sense that they need to replace old code. It would be
: impossible to maintain both, focus development and testing resources,
: and provide a reasonably stable experience with both systems tugging
: at a complicated shared code base.
: 
: On the other hand, the existing code also has billions of hours of
: production testing and tuning. We can't throw this all out overnight -
: it needs to be surgical and the broader consequences of each step need
: to be well understood.
: 
: We also have millions of servers relying on being able to do upgrades
: for drivers and fixes in other subsystems that we can't put on hold
: until we stabilized a new reclaim implementation from scratch.

Fully agreed on all points here.

I do appreciate there is a lot of work behind this patchset and I
also do understand it has gained a considerable amount of testing as
well. Your numbers are impressive but my experience tells me that it is
equally important to understand the worst case behavior and there is not
really much mentioned about those in changelogs.

We also shouldn't ignore costs the code is adding. One of them would be
a further page flags depletion. We have been hitting problems on that
front for years and many features had to be reworked to bypass a lack of
space in page->flags.

I will be looking more into the code (especially the memcg side of it)
but I really believe that a consensus on above Johannes' points need to
be found first before this work can move forward.

Thanks!
Yu Zhao Jan. 7, 2022, 6:45 p.m. UTC | #11
On Fri, Jan 07, 2022 at 10:38:18AM +0100, Michal Hocko wrote:
> On Tue 04-01-22 13:30:00, Yu Zhao wrote:
> [...]
> > Hi Andrew, Linus,
> > 
> > Can you please take a look at this patchset and let me know if it's
> > 5.17 material?
> 
> I am still not done with the review and have seen at least few problems
> that would need to be addressed.
> 
> But more fundamentally I believe there are really some important
> questions to be answered. First and foremost this is a major addition
> to the memory reclaim and there should be a wider consensus that we
> really want to go that way. The patchset doesn't have a single ack nor
> reviewed-by AFAICS. I haven't seen a lot of discussion since v2
> (http://lkml.kernel.org/r/20210413065633.2782273-1-yuzhao@google.com)
> nor do I see any clarification on how concerns raised there have been
> addressed or at least how they are planned to be addressed.
> 
> Johannes has made some excellent points
> http://lkml.kernel.org/r/YHcpzZYD2fQyWvEQ@cmpxchg.org. Let me quote
> for reference part of it I find the most important:
> : Realistically, I think incremental changes are unavoidable to get this
> : merged upstream.
> : 
> : Not just in the sense that they need to be smaller changes, but also
> : in the sense that they need to replace old code. It would be
> : impossible to maintain both, focus development and testing resources,
> : and provide a reasonably stable experience with both systems tugging
> : at a complicated shared code base.
> : 
> : On the other hand, the existing code also has billions of hours of
> : production testing and tuning. We can't throw this all out overnight -
> : it needs to be surgical and the broader consequences of each step need
> : to be well understood.
> : 
> : We also have millions of servers relying on being able to do upgrades
> : for drivers and fixes in other subsystems that we can't put on hold
> : until we stabilized a new reclaim implementation from scratch.
> 
> Fully agreed on all points here.
> 
> I do appreciate there is a lot of work behind this patchset and I
> also do understand it has gained a considerable amount of testing as
> well. Your numbers are impressive but my experience tells me that it is
> equally important to understand the worst case behavior and there is not
> really much mentioned about those in changelogs.
> 
> We also shouldn't ignore costs the code is adding. One of them would be
> a further page flags depletion. We have been hitting problems on that
> front for years and many features had to be reworked to bypass a lack of
> space in page->flags.
> 
> I will be looking more into the code (especially the memcg side of it)
> but I really believe that a consensus on above Johannes' points need to
> be found first before this work can move forward.

Thanks for the summary. I appreciate your time and I agree your
assessment is fair.

So I've acknowledged your concerns, and you've acknowledged my numbers
(the performance improvements) are impressive.

Now we are in agreement, cheers.

Next, I argue that the benefits of this patchset outweigh its risks,
because, drawing from my past experience,
1. There have been many larger and/or riskier patchsets taken; I'll
   assemble a list if you disagree. And this patchset is fully guarded
   by #ifdef; Linus has also assessed on this point.
2. There have been none that came with the testing/benchmarking
   coverage as this one did. Please point me to some if I'm mistaken,
   and I'll gladly match them.

The numbers might not materialize in the real world; the code is not
perfect; and many other risks... But all the top eight open source
memory hogs were covered, which is unprecedented; memcached and fio
showed significant improvements and it only takes a few commands to
see for yourselves.

Regarding the acks and the reviewed-bys, I certainly can ask people
who have reaped the benefits of this patchset to do them, if it's
required. But I see less fun in that. I prefer to provide empirical
evidence and convince people who are on the other side of the aisle.
Alexey Avramov Jan. 10, 2022, 2:49 p.m. UTC | #12
Note that with vm.swappiness=0, the vm.watermark_scale_factor value does
not affect the swap possibility: MGLRU ignores sc->file_is_tiny.

With the classic 2-gen LRU swapping goes well at swappiness=0 and
high vm.watermark_scale_factor, which is expected according to the
documentation:
"At 0, the kernel will not initiate swap until the amount of free and
file-backed pages is less than the high watermark in a zone." [1]

With vm.swappiness=0, no swap occurs with any vm.watermark_scale_factor
value with MGLRU.

I used to see in practice (with MGLRU v3) the impossibility of swapping 
when vm.swappiness=0 and vm.watermark_scale_factor=1000.

At a minimum, this will require updating the documentation for
vm.swappiness.

BTW, why MGLRU doesn't use something like sc->file_is_tiny?

[1] https://github.com/torvalds/linux/blob/v5.16/Documentation/admin-guide/sysctl/vm.rst#swappiness
Michal Hocko Jan. 10, 2022, 3:39 p.m. UTC | #13
On Fri 07-01-22 11:45:40, Yu Zhao wrote:
[...]
> Next, I argue that the benefits of this patchset outweigh its risks,
> because, drawing from my past experience,
> 1. There have been many larger and/or riskier patchsets taken; I'll
>    assemble a list if you disagree.

No question about that. Changes in the reclaim path are paved with
failures and reverts and fine tuning on top of existing fine tuning.
The difference from your patchset is that they tend to be much much
smaller and go incremental and therefore easier to review.

>    And this patchset is fully guarded
>    by #ifdef; Linus has also assessed on this point.

I appreciate you made the new behavior an opt-in and therefore existing
workloads are less likely to regress. I do not think ifdefs help
all that much, though, because a) realistically the config will
likely be enabled for most distribution kernels and b) the parallel
reclaim implementation adds a maintenance overhead regardless of those
ifdef. The later point is especially worrying because the memory reclaim
is a complex and hard to review beast already. Any future changes would
need to consider both reclaim algorithms of course.

Hence I argue we really need a wider consensus this is the right
direction we want to pursue.

> 2. There have been none that came with the testing/benchmarking
>    coverage as this one did. Please point me to some if I'm mistaken,
>    and I'll gladly match them.

I do appreciate your numbers but you should realize that this is an area
that is really hard to get any conclusive testing for. We keep learning
about fallouts on workloads we haven't really anticipated or where the
runtime effects happen to disagree with our intuition. So while those
numbers are nice there are other important aspects to consider like the
maintenance cost for example.

> The numbers might not materialize in the real world; the code is not
> perfect; and many other risks... But all the top eight open source
> memory hogs were covered, which is unprecedented; memcached and fio
> showed significant improvements and it only takes a few commands to
> see for yourselves.
> 
> Regarding the acks and the reviewed-bys, I certainly can ask people
> who have reaped the benefits of this patchset to do them, if it's
> required. But I see less fun in that. I prefer to provide empirical
> evidence and convince people who are on the other side of the aisle.

I like to hear from users who benefit from your work and that certainly
gives more credit to it. But it will be the MM community to maintain the
code and address future issues.

We do not have a dedicated maintainer for the memory reclaim but
certainly there are people who have helped shaping the existing code and
have learned a lot from the past issues - like Johannes, Rik, Mel just
to name few. If I were you I would be really looking into finding an
agreement with them. I myself can help you with memcg and oom side of
the things (we already have discussions about those).

Thanks!
Yu Zhao Jan. 10, 2022, 10:04 p.m. UTC | #14
On Mon, Jan 10, 2022 at 04:39:51PM +0100, Michal Hocko wrote:
> On Fri 07-01-22 11:45:40, Yu Zhao wrote:
> [...]
> > Next, I argue that the benefits of this patchset outweigh its risks,
> > because, drawing from my past experience,
> > 1. There have been many larger and/or riskier patchsets taken; I'll
> >    assemble a list if you disagree.
> 
> No question about that. Changes in the reclaim path are paved with
> failures and reverts and fine tuning on top of existing fine tuning.
> The difference from your patchset is that they tend to be much much
> smaller and go incremental and therefore easier to review.

No argument here.

> >    And this patchset is fully guarded
> >    by #ifdef; Linus has also assessed on this point.
> 
> I appreciate you made the new behavior an opt-in and therefore existing
> workloads are less likely to regress. I do not think ifdefs help
> all that much, though, because a) realistically the config will
> likely be enabled for most distribution kernels

There is also a runtime kill switch.

> b) the parallel
> reclaim implementation adds a maintenance overhead regardless of those
> ifdef. The later point is especially worrying because the memory reclaim
> is a complex and hard to review beast already. Any future changes would
> need to consider both reclaim algorithms of course.

A perfectly legitimate concern.

If this patchset is taken:
1. There will be refactoring that makes the long-term maintenance as
   affordable as possible, i.e., similar to the SL.B model, but can
   also make runtime switch.
2. There will also be optimizations for mmu notifier (KVM), THP, etc.
3. Most importantly, Google will be committing more resource on this.
   And that's why we need to hear a decision -- our resource planning
   depends on it.

> Hence I argue we really need a wider consensus this is the right
> direction we want to pursue.

We've been doing our best to get this consensus -- we invited all
the stakeholders to meetings a long time ago -- but unfortunately we
couldn't move the needle.

I agree consensus is important. But, IMO, progress is even more
important. And personally, I'd rather try something wrong than do
nothing.

> > 2. There have been none that came with the testing/benchmarking
> >    coverage as this one did. Please point me to some if I'm mistaken,
> >    and I'll gladly match them.
> 
> I do appreciate your numbers but you should realize that this is an area
> that is really hard to get any conclusive testing for.

Fully agreed. That's why we started a new initiative, and we hope more
people will following these practices:
1. All results in this area should be reported with at least standard
   deviations, or preferably confidence intervals.
2. Real applications should be benchmarked (with synthetic load
   generator), not just synthetic benchmarks (not real applications).
3. A wide range of devices should be covered, i.e., servers, desktops,
   laptops and phones.

I'm very confident to say our benchmark reports were hold to the
highest standards. We have worked with MariaDB (company), EnterpriseDB
(Postgres), Redis (company), etc. on these reports. They have copies
of these reports (PDF version):
https://linux-mm.googlesource.com/benchmarks/

We welcome any expert in those applications to examine our reports,
and we'll be happy to run any other benchmarks or same benchmarks with
different configurations that anybody thinks it's important and we've
missed.

> We keep learning
> about fallouts on workloads we haven't really anticipated or where the
> runtime effects happen to disagree with our intuition. So while those
> numbers are nice there are other important aspects to consider like the
> maintenance cost for example.

I assume we agree this is not an easy decision. Can I also assume we
agree that this decision should be make within a reasonable time frame?

> > The numbers might not materialize in the real world; the code is not
> > perfect; and many other risks... But all the top eight open source
> > memory hogs were covered, which is unprecedented; memcached and fio
> > showed significant improvements and it only takes a few commands to
> > see for yourselves.
> > 
> > Regarding the acks and the reviewed-bys, I certainly can ask people
> > who have reaped the benefits of this patchset to do them, if it's
> > required. But I see less fun in that. I prefer to provide empirical
> > evidence and convince people who are on the other side of the aisle.
> 
> I like to hear from users who benefit from your work and that certainly
> gives more credit to it. But it will be the MM community to maintain the
> code and address future issues.

I'll ask downstream kernel maintainers (from different distros) that
have taken this patchset to ACK.

I'll ask credible testers who are professionals, researchers,
contributors to other subsystems to provide Test-by's. There are many
other individual testers I may not be able to acknowledge their
efforts, e.g., my coworker just sent this to me:

   Using that v5 for some time and confirm that difference under heavy
   load and memory pressure is significant."
https://www.phoronix.com/forums/forum/software/general-linux-open-source/1301258-mglru-is-a-very-enticing-enhancement-for-linux-in-2022#post1301275

I'll leave the reviews in your capable hands. As I said, I prefer to
convince people with empirical evidence.

> We do not have a dedicated maintainer for the memory reclaim but
> certainly there are people who have helped shaping the existing code and
> have learned a lot from the past issues - like Johannes, Rik, Mel just
> to name few. If I were you I would be really looking into finding an
> agreement with them. I myself can help you with memcg and oom side of
> the things (we already have discussions about those).

Unfortunately people have different priorities. As I said, we tried
to get all the stakeholders in the same (conference) room so that we
can make some good progress. But we failed.

Rest assured, we'll keep trying. But please understand we need to do
cost control and therefore we can't keep investing in this effort
forever. So I think it's not unreasonable, after I've addressed all
pending comments, to ask for some clear instructions from the
leadership:
    Yes
    No
    Or something specific

Thanks!
Jesse Barnes Jan. 10, 2022, 10:46 p.m. UTC | #15
> > > 2. There have been none that came with the testing/benchmarking
> > >    coverage as this one did. Please point me to some if I'm mistaken,
> > >    and I'll gladly match them.
> >
> > I do appreciate your numbers but you should realize that this is an area
> > that is really hard to get any conclusive testing for.
>
> Fully agreed. That's why we started a new initiative, and we hope more
> people will following these practices:
> 1. All results in this area should be reported with at least standard
>    deviations, or preferably confidence intervals.
> 2. Real applications should be benchmarked (with synthetic load
>    generator), not just synthetic benchmarks (not real applications).
> 3. A wide range of devices should be covered, i.e., servers, desktops,
>    laptops and phones.
>
> I'm very confident to say our benchmark reports were hold to the
> highest standards. We have worked with MariaDB (company), EnterpriseDB
> (Postgres), Redis (company), etc. on these reports. They have copies
> of these reports (PDF version):
> https://linux-mm.googlesource.com/benchmarks/
>
> We welcome any expert in those applications to examine our reports,
> and we'll be happy to run any other benchmarks or same benchmarks with
> different configurations that anybody thinks it's important and we've
> missed.

I really think this gets at the heart of the issue with mm
development, and is one of the reasons it's been extra frustrating to
not have an MM conf for the past couple of years; I think sorting out
how we measure & proceed on changes would be easier done f2f.  E.g.
concluding with a consensus that if something doesn't regress on X, Y,
and Z, and has reasonably maintainable and readable code, we should
merge it and try it out.

But since f2f isn't an option until 2052 at the earliest...

I understand the desire for an "incremental approach that gets us from
A->B".  In the abstract it sounds great.  However, with a change like
this one, I think it's highly likely that such a path would be
littered with regressions both large and small, and would probably be
more difficult to reason about than the relatively clean design of
MGLRU.  On top of that, I don't think we'll get the kind of user
feedback we need for something like this *without* merging it.  Yu has
done a tremendous job collecting data here (and the results are really
incredible), but I think we can all agree that without extensive
testing in the field with all sorts of weird codes, we're not going to
find the problematic behaviors we're concerned about.

So unless we want to eschew big mm changes entirely (we shouldn't!
look at net or scheduling for how important big rewrites are to
progress), I think we should be open to experimenting with new stuff.
We can always revert if things get too unwieldy.

None of this is to say that there may not be lots more comments on the
code or potential fixes/changes to incorporate before merging; I'm
mainly arguing about the mindset we should have to changes like this,
not all the stuff the community is already really good at (i.e.
testing and reviewing code on a nuts & bolts level).

Thanks,
Jesse
Linus Torvalds Jan. 11, 2022, 1:41 a.m. UTC | #16
On Mon, Jan 10, 2022 at 2:46 PM Jesse Barnes <jsbarnes@google.com> wrote:
>
> So unless we want to eschew big mm changes entirely (we shouldn't!
> look at net or scheduling for how important big rewrites are to
> progress), I think we should be open to experimenting with new stuff.

So I personally think this is worth going with, partly simply due to
the reported improvements that have been measured.

But also to a large extent because the whole notion of doing
multi-generational LRU isn't exactly some wackadoodle crazy thing. We
already do active vs inactive, the whole multi-generational thing just
doesn't seem to be so "far out".

 But yes, numbers talk, and I get the feeling that we just need to try
it. Maybe not 5.17, but..

           Linus
Yu Zhao Jan. 11, 2022, 8:41 a.m. UTC | #17
On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
> On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
> > TLDR
> > ====
> > The current page reclaim is too expensive in terms of CPU usage and it
> > often makes poor choices about what to evict. This patchset offers an
> > alternative solution that is performant, versatile and
> > straightforward.
> 
> <snipped>
> 
> > Summery
> > =======
> > The facts are:
> > 1. The independent lab results and the real-world applications
> >    indicate substantial improvements; there are no known regressions.
> > 2. Thrashing prevention, working set estimation and proactive reclaim
> >    work out of the box; there are no equivalent solutions.
> > 3. There is a lot of new code; nobody has demonstrated smaller changes
> >    with similar effects.
> > 
> > Our options, accordingly, are:
> > 1. Given the amount of evidence, the reported improvements will likely
> >    materialize for a wide range of workloads.
> > 2. Gauging the interest from the past discussions [14][15][16], the
> >    new features will likely be put to use for both personal computers
> >    and data centers.
> > 3. Based on Google's track record, the new code will likely be well
> >    maintained in the long term. It'd be more difficult if not
> >    impossible to achieve similar effects on top of the existing
> >    design.
> 
> Hi Andrew, Linus,
> 
> Can you please take a look at this patchset and let me know if it's
> 5.17 material?
> 
> My goal is to get it merged asap so that users can reap the benefits
> and I can push the sequels. Please examine the data provided -- I
> think the unprecedented coverage and the magnitude of the improvements
> warrant a green light.

Downstream kernel maintainers who have been carrying MGLRU for more than
3 versions, can you please provide your Acked-by tags?

Having this patchset in the mainline will make your job easier :)

   Alexandre - the XanMod Kernel maintainer
               https://xanmod.org
   
   Brian     - the Chrome OS kernel memory maintainer
               https://www.chromium.org
   
   Jan       - the Arch Linux Zen kernel maintainer
               https://archlinux.org
   
   Steven    - the Liquorix kernel maintainer
               https://liquorix.net
   
   Suleiman  - the ARCVM (Android downstream) kernel memory maintainer
               https://chromium.googlesource.com/chromiumos/third_party/kernel

Also my gratitude to those who have helped test MGLRU:

   Daniel - researcher at Michigan Tech
            benchmarked memcached
   
   Holger - who has been testing/patching/contributing to various
            subsystems since ~2008
   
   Shuang - researcher at University of Rochester
            benchmarked fio and provided a report
   
   Sofia  - EDI https://www.edi.works
            benchmarked the top eight memory hogs and provided reports

Can you please provide your Tested-by tags? This will ensure the credit
for your contributions.

Thanks!
Holger Hoffstätte Jan. 11, 2022, 8:53 a.m. UTC | #18
On 2022-01-11 09:41, Yu Zhao wrote:
> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
>> On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
>>> TLDR
>>> ====
>>> The current page reclaim is too expensive in terms of CPU usage and it
>>> often makes poor choices about what to evict. This patchset offers an
>>> alternative solution that is performant, versatile and
>>> straightforward.
>>
>> <snipped>
>>
>>> Summery
>>> =======
>>> The facts are:
>>> 1. The independent lab results and the real-world applications
>>>     indicate substantial improvements; there are no known regressions.
>>> 2. Thrashing prevention, working set estimation and proactive reclaim
>>>     work out of the box; there are no equivalent solutions.
>>> 3. There is a lot of new code; nobody has demonstrated smaller changes
>>>     with similar effects.
>>>
>>> Our options, accordingly, are:
>>> 1. Given the amount of evidence, the reported improvements will likely
>>>     materialize for a wide range of workloads.
>>> 2. Gauging the interest from the past discussions [14][15][16], the
>>>     new features will likely be put to use for both personal computers
>>>     and data centers.
>>> 3. Based on Google's track record, the new code will likely be well
>>>     maintained in the long term. It'd be more difficult if not
>>>     impossible to achieve similar effects on top of the existing
>>>     design.
>>
>> Hi Andrew, Linus,
>>
>> Can you please take a look at this patchset and let me know if it's
>> 5.17 material?
>>
>> My goal is to get it merged asap so that users can reap the benefits
>> and I can push the sequels. Please examine the data provided -- I
>> think the unprecedented coverage and the magnitude of the improvements
>> warrant a green light.
> 
> Downstream kernel maintainers who have been carrying MGLRU for more than
> 3 versions, can you please provide your Acked-by tags?
> 
> Having this patchset in the mainline will make your job easier :)
> 
>     Alexandre - the XanMod Kernel maintainer
>                 https://xanmod.org
>     
>     Brian     - the Chrome OS kernel memory maintainer
>                 https://www.chromium.org
>     
>     Jan       - the Arch Linux Zen kernel maintainer
>                 https://archlinux.org
>     
>     Steven    - the Liquorix kernel maintainer
>                 https://liquorix.net
>     
>     Suleiman  - the ARCVM (Android downstream) kernel memory maintainer
>                 https://chromium.googlesource.com/chromiumos/third_party/kernel
> 
> Also my gratitude to those who have helped test MGLRU:
> 
>     Daniel - researcher at Michigan Tech
>              benchmarked memcached
>     
>     Holger - who has been testing/patching/contributing to various
>              subsystems since ~2008
>     
>     Shuang - researcher at University of Rochester
>              benchmarked fio and provided a report
>     
>     Sofia  - EDI https://www.edi.works
>              benchmarked the top eight memory hogs and provided reports
> 
> Can you please provide your Tested-by tags? This will ensure the credit
> for your contributions.
> 
> Thanks!
> 

Have been pounding on this "in production" on several different machines
(server, desktop, laptop) and 5.15.x without any issues, so:

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>

Looking forward to seeing this in mainline!

cheers,
Holger
Alexey Avramov Jan. 11, 2022, 10:24 a.m. UTC | #19
In some of my benchmarks MGLRU really gave unrivaled performance.
I assume the adoption of MGLRU into the kernel would save billions of
dollars and greatly reduce carbon dioxide emissions.

However, there are also cases where MGLRU loses.
There are cases where MGLRU does not achieve the performance that the
classic LRU gives (at least I got such results when testing MGLRU before[1],
but I did not report them here).

As a Linux user, I would like to see both variants of LRU in the kernel, so
that it is possible to switch to the suitable variant when needed: none of
the LRU variants allowed me to squeeze the maximum for all cases.

I hope to test MGLRU v6 later and show you some of its weaknesses and
anomalies with specific logs and benchmarks. 

[1] I didn't have enough time and energy to decipher the results at that time:
    https://github.com/hakavlad/cache-tests/tree/main/mg-LRU-v3_vs_classic-LRU
    (but you can try to guess what it all means)
Michal Hocko Jan. 11, 2022, 10:40 a.m. UTC | #20
On Mon 10-01-22 14:46:08, Jesse Barnes wrote:
> > > > 2. There have been none that came with the testing/benchmarking
> > > >    coverage as this one did. Please point me to some if I'm mistaken,
> > > >    and I'll gladly match them.
> > >
> > > I do appreciate your numbers but you should realize that this is an area
> > > that is really hard to get any conclusive testing for.
> >
> > Fully agreed. That's why we started a new initiative, and we hope more
> > people will following these practices:
> > 1. All results in this area should be reported with at least standard
> >    deviations, or preferably confidence intervals.
> > 2. Real applications should be benchmarked (with synthetic load
> >    generator), not just synthetic benchmarks (not real applications).
> > 3. A wide range of devices should be covered, i.e., servers, desktops,
> >    laptops and phones.
> >
> > I'm very confident to say our benchmark reports were hold to the
> > highest standards. We have worked with MariaDB (company), EnterpriseDB
> > (Postgres), Redis (company), etc. on these reports. They have copies
> > of these reports (PDF version):
> > https://linux-mm.googlesource.com/benchmarks/
> >
> > We welcome any expert in those applications to examine our reports,
> > and we'll be happy to run any other benchmarks or same benchmarks with
> > different configurations that anybody thinks it's important and we've
> > missed.
> 
> I really think this gets at the heart of the issue with mm
> development, and is one of the reasons it's been extra frustrating to
> not have an MM conf for the past couple of years; I think sorting out
> how we measure & proceed on changes would be easier done f2f.  E.g.
> concluding with a consensus that if something doesn't regress on X, Y,
> and Z, and has reasonably maintainable and readable code, we should
> merge it and try it out.

I am fully with you on that! I hope we can have LSFMM this year finally.

> But since f2f isn't an option until 2052 at the earliest...

Let's be more optimistic than that ;)

> I understand the desire for an "incremental approach that gets us from
> A->B".  In the abstract it sounds great.  However, with a change like
> this one, I think it's highly likely that such a path would be
> littered with regressions both large and small, and would probably be
> more difficult to reason about than the relatively clean design of
> MGLRU.

There are certainly things that do not make much sense to split up of
course. On the other hand the patchset is making a lot of decisions and
assumptions that are neither documented in the code nor in the
changelog. From my past experience these are really problematic from a
long term maintenance POV. We are struggling with those already because
changelog tend to be much more coarse in the past yet the code stays
with us and we have been really "great" at not touching many of those
because "something might break". This results in the complexity grow and
further maintenance burden.

> On top of that, I don't think we'll get the kind of user
> feedback we need for something like this *without* merging it.  Yu has
> done a tremendous job collecting data here (and the results are really
> incredible), but I think we can all agree that without extensive
> testing in the field with all sorts of weird codes, we're not going to
> find the problematic behaviors we're concerned about.

This is understood.

> So unless we want to eschew big mm changes entirely (we shouldn't!
> look at net or scheduling for how important big rewrites are to
> progress), I think we should be open to experimenting with new stuff.
> We can always revert if things get too unwieldy.

As long as the patchset doesn't include new user visible interfaces
which have proven to be really hard to revert.

> None of this is to say that there may not be lots more comments on the
> code or potential fixes/changes to incorporate before merging; I'm
> mainly arguing about the mindset we should have to changes like this,
> not all the stuff the community is already really good at (i.e.
> testing and reviewing code on a nuts & bolts level).

From my reading of this and previous discussions I have gathered that
there was no opposition just for the sake of it. There have been very
specific questions regarding the implementation and/or future plans to
address issues expressed in the past.

So far I have only managed to check the memcg and oom integration
finding some issues there. All of them should be fixable reasonably
easily but it also points that a deep dive into this is really
necessary.

I have also raised questions about future maintainability of the
resulting code. As you could have noticed the review power in the MM
community is lacking behind and we tend to have more code producers than
reviewers and maintainers.
Not to mention other things like page flags depletion which is something
we have been struggling for quite some time already.

All that being said there is a lot of work for such a large change to be
merged.
Shuang Zhai Jan. 11, 2022, 4:04 p.m. UTC | #21
Yu Zhao wrote:
> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
>> On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
>>> TLDR
>>> ====
>>> The current page reclaim is too expensive in terms of CPU usage and it
>>> often makes poor choices about what to evict. This patchset offers an
>>> alternative solution that is performant, versatile and
>>> straightforward.
>>
>> <snipped>
>>
>>> Summery
>>> =======
>>> The facts are:
>>> 1. The independent lab results and the real-world applications
>>>     indicate substantial improvements; there are no known regressions.
>>> 2. Thrashing prevention, working set estimation and proactive reclaim
>>>     work out of the box; there are no equivalent solutions.
>>> 3. There is a lot of new code; nobody has demonstrated smaller changes
>>>     with similar effects.
>>>
>>> Our options, accordingly, are:
>>> 1. Given the amount of evidence, the reported improvements will likely
>>>     materialize for a wide range of workloads.
>>> 2. Gauging the interest from the past discussions [14][15][16], the
>>>     new features will likely be put to use for both personal computers
>>>     and data centers.
>>> 3. Based on Google's track record, the new code will likely be well
>>>     maintained in the long term. It'd be more difficult if not
>>>     impossible to achieve similar effects on top of the existing
>>>     design.
>>
>> Hi Andrew, Linus,
>>
>> Can you please take a look at this patchset and let me know if it's
>> 5.17 material?
>>
>> My goal is to get it merged asap so that users can reap the benefits
>> and I can push the sequels. Please examine the data provided -- I
>> think the unprecedented coverage and the magnitude of the improvements
>> warrant a green light.
> 
> Downstream kernel maintainers who have been carrying MGLRU for more than
> 3 versions, can you please provide your Acked-by tags?
> 
> Having this patchset in the mainline will make your job easier :)
> 
>     Alexandre - the XanMod Kernel maintainer
>                 https://xanmod.org
>     
>     Brian     - the Chrome OS kernel memory maintainer
>                 https://www.chromium.org
>     
>     Jan       - the Arch Linux Zen kernel maintainer
>                 https://archlinux.org
>     
>     Steven    - the Liquorix kernel maintainer
>                 https://liquorix.net
>     
>     Suleiman  - the ARCVM (Android downstream) kernel memory maintainer
>                 https://chromium.googlesource.com/chromiumos/third_party/kernel
> 
> Also my gratitude to those who have helped test MGLRU:
> 
>     Daniel - researcher at Michigan Tech
>              benchmarked memcached
>     
>     Holger - who has been testing/patching/contributing to various
>              subsystems since ~2008
>     
>     Shuang - researcher at University of Rochester
>              benchmarked fio and provided a report
>     
>     Sofia  - EDI https://www.edi.works
>              benchmarked the top eight memory hogs and provided reports
> 
> Can you please provide your Tested-by tags? This will ensure the credit
> for your contributions.
> 
> Thanks!

I have tested MGLRU using fio [1]. The performance improvement is fabulous.
I hope this patchset could eventually get merged to enable large scale test
and let more users talk about their experience.

Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>

[1] https://lore.kernel.org/lkml/20220105024423.26409-1-szhai2@cs.rochester.edu/
Suleiman Souhlal Jan. 12, 2022, 1:46 a.m. UTC | #22
On Tue, Jan 11, 2022 at 5:41 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
> > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
> > > TLDR
> > > ====
> > > The current page reclaim is too expensive in terms of CPU usage and it
> > > often makes poor choices about what to evict. This patchset offers an
> > > alternative solution that is performant, versatile and
> > > straightforward.
> >
> > <snipped>
> >
> > > Summery
> > > =======
> > > The facts are:
> > > 1. The independent lab results and the real-world applications
> > >    indicate substantial improvements; there are no known regressions.
> > > 2. Thrashing prevention, working set estimation and proactive reclaim
> > >    work out of the box; there are no equivalent solutions.
> > > 3. There is a lot of new code; nobody has demonstrated smaller changes
> > >    with similar effects.
> > >
> > > Our options, accordingly, are:
> > > 1. Given the amount of evidence, the reported improvements will likely
> > >    materialize for a wide range of workloads.
> > > 2. Gauging the interest from the past discussions [14][15][16], the
> > >    new features will likely be put to use for both personal computers
> > >    and data centers.
> > > 3. Based on Google's track record, the new code will likely be well
> > >    maintained in the long term. It'd be more difficult if not
> > >    impossible to achieve similar effects on top of the existing
> > >    design.
> >
> > Hi Andrew, Linus,
> >
> > Can you please take a look at this patchset and let me know if it's
> > 5.17 material?
> >
> > My goal is to get it merged asap so that users can reap the benefits
> > and I can push the sequels. Please examine the data provided -- I
> > think the unprecedented coverage and the magnitude of the improvements
> > warrant a green light.
>
> Downstream kernel maintainers who have been carrying MGLRU for more than
> 3 versions, can you please provide your Acked-by tags?
>
> Having this patchset in the mainline will make your job easier :)
>
>    Alexandre - the XanMod Kernel maintainer
>                https://xanmod.org
>
>    Brian     - the Chrome OS kernel memory maintainer
>                https://www.chromium.org
>
>    Jan       - the Arch Linux Zen kernel maintainer
>                https://archlinux.org
>
>    Steven    - the Liquorix kernel maintainer
>                https://liquorix.net
>
>    Suleiman  - the ARCVM (Android downstream) kernel memory maintainer
>                https://chromium.googlesource.com/chromiumos/third_party/kernel

Android on ChromeOS has been using MGLRU for a while now, with great results.
It would be great for more people to more easily be able to benefit from it.

Acked-by: Suleiman Souhlal <suleiman@google.com>

-- Suleiman
Sofia Trinh Jan. 12, 2022, 6:07 a.m. UTC | #23
On Tue, Jan 11, 2022 at 12:41 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
> > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
> > > TLDR
> > > ====
> > > The current page reclaim is too expensive in terms of CPU usage and it
> > > often makes poor choices about what to evict. This patchset offers an
> > > alternative solution that is performant, versatile and
> > > straightforward.
> >
> > <snipped>
> >
> > > Summery
> > > =======
> > > The facts are:
> > > 1. The independent lab results and the real-world applications
> > >    indicate substantial improvements; there are no known regressions.
> > > 2. Thrashing prevention, working set estimation and proactive reclaim
> > >    work out of the box; there are no equivalent solutions.
> > > 3. There is a lot of new code; nobody has demonstrated smaller changes
> > >    with similar effects.
> > >
> > > Our options, accordingly, are:
> > > 1. Given the amount of evidence, the reported improvements will likely
> > >    materialize for a wide range of workloads.
> > > 2. Gauging the interest from the past discussions [14][15][16], the
> > >    new features will likely be put to use for both personal computers
> > >    and data centers.
> > > 3. Based on Google's track record, the new code will likely be well
> > >    maintained in the long term. It'd be more difficult if not
> > >    impossible to achieve similar effects on top of the existing
> > >    design.
> >
> > Hi Andrew, Linus,
> >
> > Can you please take a look at this patchset and let me know if it's
> > 5.17 material?
> >
> > My goal is to get it merged asap so that users can reap the benefits
> > and I can push the sequels. Please examine the data provided -- I
> > think the unprecedented coverage and the magnitude of the improvements
> > warrant a green light.
>
> Downstream kernel maintainers who have been carrying MGLRU for more than
> 3 versions, can you please provide your Acked-by tags?
>
> Having this patchset in the mainline will make your job easier :)
>
>    Alexandre - the XanMod Kernel maintainer
>                https://xanmod.org
>
>    Brian     - the Chrome OS kernel memory maintainer
>                https://www.chromium.org
>
>    Jan       - the Arch Linux Zen kernel maintainer
>                https://archlinux.org
>
>    Steven    - the Liquorix kernel maintainer
>                https://liquorix.net
>
>    Suleiman  - the ARCVM (Android downstream) kernel memory maintainer
>                https://chromium.googlesource.com/chromiumos/third_party/kernel
>
> Also my gratitude to those who have helped test MGLRU:
>
>    Daniel - researcher at Michigan Tech
>             benchmarked memcached
>
>    Holger - who has been testing/patching/contributing to various
>             subsystems since ~2008
>
>    Shuang - researcher at University of Rochester
>             benchmarked fio and provided a report
>
>    Sofia  - EDI https://www.edi.works
>             benchmarked the top eight memory hogs and provided reports

Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Oleksandr Natalenko Jan. 12, 2022, 8:56 p.m. UTC | #24
Hello.

On úterý 4. ledna 2022 21:22:19 CET Yu Zhao wrote:
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and it
> often makes poor choices about what to evict. This patchset offers an
> alternative solution that is performant, versatile and
> straightforward.
> 
> Design objectives
> =================
> The design objectives are:
> 1. Better representation of access recency
> 2. Try to profit from spatial locality
> 3. Clear fast path making obvious choices
> 4. Simple self-correcting heuristics
> 
> The representation of access recency is at the core of all LRU
> approximations. The multigenerational LRU (MGLRU) divides pages into
> multiple lists (generations), each having bounded access recency (a
> time interval). Generations establish a common frame of reference and
> help make better choices, e.g., between different memcgs on a computer
> or different computers in a data center (for cluster job scheduling).
> 
> Exploiting spatial locality improves the efficiency when gathering the
> accessed bit. A rmap walk targets a single page and doesn't try to
> profit from discovering an accessed PTE. A page table walk can sweep
> all hotspots in an address space, but its search space can be too
> large to make a profit. The key is to optimize both methods and use
> them in combination. (PMU is another option for further exploration.)
> 
> Fast path reduces code complexity and runtime overhead. Unmapped pages
> don't require TLB flushes; clean pages don't require writeback. These
> facts are only helpful when other conditions, e.g., access recency,
> are similar. With generations as a common frame of reference,
> additional factors stand out. But obvious choices might not be good
> choices; thus self-correction is required (the next objective).
> 
> The benefits of simple self-correcting heuristics are self-evident.
> Again with generations as a common frame of reference, this becomes
> attainable. Specifically, pages in the same generation are categorized
> based on additional factors, and a closed-loop control statistically
> compares the refault percentages across all categories and throttles
> the eviction of those that have higher percentages.
> 
> Patchset overview
> =================
> 1. mm: x86, arm64: add arch_has_hw_pte_young()
> 2. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> Materializing hardware optimizations when trying to clear the accessed
> bit in many PTEs. If hardware automatically sets the accessed bit in
> PTEs, there is no need to worry about bursty page faults (emulating
> the accessed bit). If it also sets the accessed bit in non-leaf PMD
> entries, there is no need to search the PTE table pointed to by a PMD
> entry that doesn't have the accessed bit set.
> 
> 3. mm/vmscan.c: refactor shrink_node()
> A minor refactor.
> 
> 4. mm: multigenerational lru: groundwork
> Adding the basic data structure and the functions to initialize it and
> insert/remove pages.
> 
> 5. mm: multigenerational lru: mm_struct list
> An infra keeps track of mm_struct's for page table walkers and
> provides them with optimizations, i.e., switch_mm() tracking and Bloom
> filters.
> 
> 6. mm: multigenerational lru: aging
> 7. mm: multigenerational lru: eviction
> "The page reclaim" is a producer/consumer model. "The aging" produces
> cold pages, whereas "the eviction " consumes them. Cold pages flow
> through generations. The aging uses the mm_struct list infra to sweep
> dense hotspots in page tables. During a page table walk, the aging
> clears the accessed bit and tags accessed pages with the youngest
> generation number. The eviction sorts those pages when it encounters
> them. For pages in the oldest generation, eviction walks the rmap to
> check the accessed bit one more time before evicting them. During an
> rmap walk, the eviction feeds dense hotspots back to the aging. Dense
> hotspots flow through the Bloom filters. For pages not mapped in page
> tables, the eviction uses the PID controller to statistically
> determine whether they have higher refaults. If so, the eviction
> throttles their eviction by moving them to the next generation (the
> second oldest).
> 
> 8. mm: multigenerational lru: user interface
> The knobs to turn on/off MGLRU and provide the userspace with
> thrashing prevention, working set estimation (the aging) and proactive
> reclaim (the eviction).
> 
> 9. mm: multigenerational lru: Kconfig
> The Kconfig options.
> 
> Benchmark results
> =================
> Independent lab results
> -----------------------
> Based on the popularity of searches [01] and the memory usage in
> Google's public cloud, the most popular open-source memory-hungry
> applications, in alphabetical order, are:
>       Apache Cassandra      Memcached
>       Apache Hadoop         MongoDB
>       Apache Spark          PostgreSQL
>       MariaDB (MySQL)       Redis
> 
> An independent lab evaluated MGLRU with the most widely used benchmark
> suites for the above applications. They posted 960 data points along
> with kernel metrics and perf profiles collected over more than 500
> hours of total benchmark time. Their final reports show that, with 95%
> confidence intervals (CIs), the above applications all performed
> significantly better for at least part of their benchmark matrices.
> 
> On 5.14:
> 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
>    less wall time to sort three billion random integers, respectively,
>    under the medium- and the high-concurrency conditions, when
>    overcommitting memory. There were no statistically significant
>    changes in wall time for the rest of the benchmark matrix.
> 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
>    more transactions per minute (TPM), respectively, under the medium-
>    and the high-concurrency conditions, when overcommitting memory.
>    There were no statistically significant changes in TPM for the rest
>    of the benchmark matrix.
> 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
>    and [21.59, 30.02]% more operations per second (OPS), respectively,
>    for sequential access, random access and Gaussian (distribution)
>    access, when THP=always; 95% CIs [13.85, 15.97]% and
>    [23.94, 29.92]% more OPS, respectively, for random access and
>    Gaussian access, when THP=never. There were no statistically
>    significant changes in OPS for the rest of the benchmark matrix.
> 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
>    [2.16, 3.55]% more operations per second (OPS), respectively, for
>    exponential (distribution) access, random access and Zipfian
>    (distribution) access, when underutilizing memory; 95% CIs
>    [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
>    respectively, for exponential access, random access and Zipfian
>    access, when overcommitting memory.
> 
> On 5.15:
> 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
>    and [4.11, 7.50]% more operations per second (OPS), respectively,
>    for exponential (distribution) access, random access and Zipfian
>    (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
>    [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
>    exponential access, random access and Zipfian access, when swap was
>    on.
> 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
>    less average wall time to finish twelve parallel TeraSort jobs,
>    respectively, under the medium- and the high-concurrency
>    conditions, when swap was on. There were no statistically
>    significant changes in average wall time for the rest of the
>    benchmark matrix.
> 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
>    minute (TPM) under the high-concurrency condition, when swap was
>    off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
>    respectively, under the medium- and the high-concurrency
>    conditions, when swap was on. There were no statistically
>    significant changes in TPM for the rest of the benchmark matrix.
> 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
>    [11.47, 19.36]% more total operations per second (OPS),
>    respectively, for sequential access, random access and Gaussian
>    (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
>    [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
>    for sequential access, random access and Gaussian access, when
>    THP=never.
> 
> Our lab results
> ---------------
> To supplement the above results, we ran the following benchmark suites
> on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
> are popular among MM developers, but we prefer large-scale A/B
> experiments to validate improvements.)
>       fs_fio_bench_hdd_mq      pft
>       fs_lmbench               pgsql-hammerdb
>       fs_parallelio            redis
>       fs_postmark              stream
>       hackbench                sysbenchthread
>       kernbench                tpcc_spark
>       memcached                unixbench
>       multichase               vm-scalability
>       mutilate                 will-it-scale
>       nginx
> 
> [01] https://trends.google.com
> [02] https://lore.kernel.org/linux-mm/20211102002002.92051-1-bot@edi.works/
> [03] https://lore.kernel.org/linux-mm/20211009054315.47073-1-bot@edi.works/
> [04] https://lore.kernel.org/linux-mm/20211021194103.65648-1-bot@edi.works/
> [05] https://lore.kernel.org/linux-mm/20211109021346.50266-1-bot@edi.works/
> [06] https://lore.kernel.org/linux-mm/20211202062806.80365-1-bot@edi.works/
> [07] https://lore.kernel.org/linux-mm/20211209072416.33606-1-bot@edi.works/
> [08] https://lore.kernel.org/linux-mm/20211218071041.24077-1-bot@edi.works/
> [09] https://lore.kernel.org/linux-mm/20211122053248.57311-1-bot@edi.works/
> [10] https://lore.kernel.org/linux-mm/20220104202247.2903702-1-yuzhao@google.com/
> 
> Read-world applications
> =======================
> Third-party testimonials
> ------------------------
> Konstantin wrote [11]:
>    I have Archlinux with 8G RAM + zswap + swap. While developing, I
>    have lots of apps opened such as multiple LSP-servers for different
>    langs, chats, two browsers, etc... Usually, my system gets quickly
>    to a point of SWAP-storms, where I have to kill LSP-servers,
>    restart browsers to free memory, etc, otherwise the system lags
>    heavily and is barely usable.
>    
>    1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
>    patchset, and I started up by opening lots of apps to create memory
>    pressure, and worked for a day like this. Till now I had *not a
>    single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
>    getting to the point of 3G in SWAP before without a single
>    SWAP-storm.
> 
> The Arch Linux Zen kernel [12] has been using MGLRU since 5.12. Many
> of its users reported their positive experiences to me, e.g., Shivodit
> wrote:
>    I've tried the latest Zen kernel (5.14.13-zen1-1-zen in the
>    archlinux testing repos), everything's been smooth so far. I also
>    decided to copy a large volume of files to check performance under
>    I/O load, and everything went smoothly - no stuttering was present,
>    everything was responsive.
> 
> Large-scale deployments
> -----------------------
> We've rolled out MGLRU to tens of millions of Chrome OS users and
> about a million Android users. Google's fleetwide profiling [13] shows
> an overall 40% decrease in kswapd CPU usage, in addition to
> improvements in other UX metrics, e.g., an 85% decrease in the number
> of low-memory kills at the 75th percentile and an 18% decrease in
> rendering latency at the 50th percentile.
> 
> [11] https://lore.kernel.org/linux-mm/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
> [12] https://github.com/zen-kernel/zen-kernel/
> [13] https://research.google/pubs/pub44271/
> 
> Summery
> =======
> The facts are:
> 1. The independent lab results and the real-world applications
>    indicate substantial improvements; there are no known regressions.
> 2. Thrashing prevention, working set estimation and proactive reclaim
>    work out of the box; there are no equivalent solutions.
> 3. There is a lot of new code; nobody has demonstrated smaller changes
>    with similar effects.
> 
> Our options, accordingly, are:
> 1. Given the amount of evidence, the reported improvements will likely
>    materialize for a wide range of workloads.
> 2. Gauging the interest from the past discussions [14][15][16], the
>    new features will likely be put to use for both personal computers
>    and data centers.
> 3. Based on Google's track record, the new code will likely be well
>    maintained in the long term. It'd be more difficult if not
>    impossible to achieve similar effects on top of the existing
>    design.
> 
> [14] https://lore.kernel.org/lkml/20201005081313.732745-1-andrea.righi@canonical.com/
> [15] https://lore.kernel.org/lkml/20210716081449.22187-1-sj38.park@gmail.com/
> [16] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/
> 
> Yu Zhao (9):
>   mm: x86, arm64: add arch_has_hw_pte_young()
>   mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
>   mm/vmscan.c: refactor shrink_node()
>   mm: multigenerational lru: groundwork
>   mm: multigenerational lru: mm_struct list
>   mm: multigenerational lru: aging
>   mm: multigenerational lru: eviction
>   mm: multigenerational lru: user interface
>   mm: multigenerational lru: Kconfig
> 
>  Documentation/vm/index.rst          |    1 +
>  Documentation/vm/multigen_lru.rst   |   80 +
>  arch/Kconfig                        |    9 +
>  arch/arm64/include/asm/cpufeature.h |    5 +
>  arch/arm64/include/asm/pgtable.h    |   13 +-
>  arch/arm64/kernel/cpufeature.c      |   19 +
>  arch/arm64/tools/cpucaps            |    1 +
>  arch/x86/Kconfig                    |    1 +
>  arch/x86/include/asm/pgtable.h      |    9 +-
>  arch/x86/mm/pgtable.c               |    5 +-
>  fs/exec.c                           |    2 +
>  fs/fuse/dev.c                       |    3 +-
>  include/linux/cgroup.h              |   15 +-
>  include/linux/memcontrol.h          |   11 +
>  include/linux/mm.h                  |   42 +
>  include/linux/mm_inline.h           |  204 ++
>  include/linux/mm_types.h            |   78 +
>  include/linux/mmzone.h              |  175 ++
>  include/linux/nodemask.h            |    1 +
>  include/linux/oom.h                 |   16 +
>  include/linux/page-flags-layout.h   |   19 +-
>  include/linux/page-flags.h          |    4 +-
>  include/linux/pgtable.h             |   17 +-
>  include/linux/sched.h               |    4 +
>  include/linux/swap.h                |    4 +
>  kernel/bounds.c                     |    3 +
>  kernel/cgroup/cgroup-internal.h     |    1 -
>  kernel/exit.c                       |    1 +
>  kernel/fork.c                       |    9 +
>  kernel/sched/core.c                 |    1 +
>  mm/Kconfig                          |   48 +
>  mm/huge_memory.c                    |    3 +-
>  mm/memcontrol.c                     |   26 +
>  mm/memory.c                         |   21 +-
>  mm/mm_init.c                        |    6 +-
>  mm/oom_kill.c                       |    4 +-
>  mm/page_alloc.c                     |    1 +
>  mm/rmap.c                           |    7 +
>  mm/swap.c                           |   51 +-
>  mm/vmscan.c                         | 2691 ++++++++++++++++++++++++++-
>  mm/workingset.c                     |  119 +-
>  41 files changed, 3591 insertions(+), 139 deletions(-)
>  create mode 100644 Documentation/vm/multigen_lru.rst

For the series:

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>

I run this (and one of the previous spins) on nine machines (physical, virtual, workstations, servers) for quite some time with no hassle.

Thanks for your job, and please keep me in Cc once you post new spins. I'm more than happy to deploy those across the fleet.
Yu Zhao Jan. 13, 2022, 8:59 a.m. UTC | #25
On Wed, Jan 12, 2022 at 09:56:58PM +0100, Oleksandr Natalenko wrote:
> Hello.
> 
> On úterý 4. ledna 2022 21:22:19 CET Yu Zhao wrote:
> > TLDR
> > ====
> > The current page reclaim is too expensive in terms of CPU usage and it
> > often makes poor choices about what to evict. This patchset offers an
> > alternative solution that is performant, versatile and
> > straightforward.

<snipped>

> For the series:
> 
> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> 
> I run this (and one of the previous spins) on nine machines (physical, virtual, workstations, servers) for quite some time with no hassle.
> 
> Thanks for your job, and please keep me in Cc once you post new spins. I'm more than happy to deploy those across the fleet.

Thanks, Oleksandr. And if I may take the liberty of introducing you as:

   Oleksandr - the post-factum kernel maintainer
               https://gitlab.com/post-factum/pf-kernel

in addition to other downstream kernel maintainers I've introduced:

>   Alexandre - the XanMod kernel maintainer
>               https://xanmod.org
>
>   Brian     - the Chrome OS kernel memory maintainer
>               https://www.chromium.org
>
>   Jan       - the Arch Linux Zen kernel maintainer
>               https://archlinux.org
>
>   Steven    - the Liquorix kernel maintainer
>               https://liquorix.net
>
>   Suleiman  - the ARCVM (Android downstream) kernel memory maintainer
>               https://chromium.googlesource.com/chromiumos/third_party/kernel
Yu Zhao Jan. 18, 2022, 9:21 a.m. UTC | #26
On Tue, Jan 11, 2022 at 01:41:22AM -0700, Yu Zhao wrote:
> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
> > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
> > > TLDR
> > > ====
> > > The current page reclaim is too expensive in terms of CPU usage and it
> > > often makes poor choices about what to evict. This patchset offers an
> > > alternative solution that is performant, versatile and
> > > straightforward.
> > 
> > <snipped>
> > 
> > > Summery
> > > =======
> > > The facts are:
> > > 1. The independent lab results and the real-world applications
> > >    indicate substantial improvements; there are no known regressions.
> > > 2. Thrashing prevention, working set estimation and proactive reclaim
> > >    work out of the box; there are no equivalent solutions.
> > > 3. There is a lot of new code; nobody has demonstrated smaller changes
> > >    with similar effects.
> > > 
> > > Our options, accordingly, are:
> > > 1. Given the amount of evidence, the reported improvements will likely
> > >    materialize for a wide range of workloads.
> > > 2. Gauging the interest from the past discussions [14][15][16], the
> > >    new features will likely be put to use for both personal computers
> > >    and data centers.
> > > 3. Based on Google's track record, the new code will likely be well
> > >    maintained in the long term. It'd be more difficult if not
> > >    impossible to achieve similar effects on top of the existing
> > >    design.
> > 
> > Hi Andrew, Linus,
> > 
> > Can you please take a look at this patchset and let me know if it's
> > 5.17 material?
> > 
> > My goal is to get it merged asap so that users can reap the benefits
> > and I can push the sequels. Please examine the data provided -- I
> > think the unprecedented coverage and the magnitude of the improvements
> > warrant a green light.

My gratitude to Donald who has been helping test MGLRU since v2:

    Donald Carr (d@chaos-reins.com)

    Founder of Chaos Reins (http://chaos-reins.com), an SF based
    consultancy company specializing in designing/creating embedded
    Linux appliances.

Can you please provide your Tested-by tags? This will ensure the credit
for your contributions.

Thanks!
Donald Carr Jan. 18, 2022, 9:36 a.m. UTC | #27
January 18, 2022 1:21 AM, "Yu Zhao" <yuzhao@google.com> wrote:

> On Tue, Jan 11, 2022 at 01:41:22AM -0700, Yu Zhao wrote:
> 
>> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
>> On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
>>> TLDR
>>> ====
>>> The current page reclaim is too expensive in terms of CPU usage and it
>>> often makes poor choices about what to evict. This patchset offers an
>>> alternative solution that is performant, versatile and
>>> straightforward.
>> 
>> <snipped>
>> 
>>> Summery
>>> =======
>>> The facts are:
>>> 1. The independent lab results and the real-world applications
>>> indicate substantial improvements; there are no known regressions.
>>> 2. Thrashing prevention, working set estimation and proactive reclaim
>>> work out of the box; there are no equivalent solutions.
>>> 3. There is a lot of new code; nobody has demonstrated smaller changes
>>> with similar effects.
>>> 
>>> Our options, accordingly, are:
>>> 1. Given the amount of evidence, the reported improvements will likely
>>> materialize for a wide range of workloads.
>>> 2. Gauging the interest from the past discussions [14][15][16], the
>>> new features will likely be put to use for both personal computers
>>> and data centers.
>>> 3. Based on Google's track record, the new code will likely be well
>>> maintained in the long term. It'd be more difficult if not
>>> impossible to achieve similar effects on top of the existing
>>> design.
>> 
>> Hi Andrew, Linus,
>> 
>> Can you please take a look at this patchset and let me know if it's
>> 5.17 material?
>> 
>> My goal is to get it merged asap so that users can reap the benefits
>> and I can push the sequels. Please examine the data provided -- I
>> think the unprecedented coverage and the magnitude of the improvements
>> warrant a green light.
> 
> My gratitude to Donald who has been helping test MGLRU since v2:
> 
> Donald Carr (d@chaos-reins.com)
> 
> Founder of Chaos Reins (http://chaos-reins.com), an SF based
> consultancy company specializing in designing/creating embedded
> Linux appliances.

Tested-by: Donald Carr <d@chaos-reins.com>

> Can you please provide your Tested-by tags? This will ensure the credit
> for your contributions.
> 
> Thanks!
Steven Barrett Jan. 19, 2022, 8:19 p.m. UTC | #28
On Tue, Jan 11, 2022, at 2:41 AM, Yu Zhao wrote:
> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
> > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
> > > TLDR
> > > ====
> > > The current page reclaim is too expensive in terms of CPU usage and it
> > > often makes poor choices about what to evict. This patchset offers an
> > > alternative solution that is performant, versatile and
> > > straightforward.
> > 
> > <snipped>
> > 
> > > Summery
> > > =======
> > > The facts are:
> > > 1. The independent lab results and the real-world applications
> > >    indicate substantial improvements; there are no known regressions.
> > > 2. Thrashing prevention, working set estimation and proactive reclaim
> > >    work out of the box; there are no equivalent solutions.
> > > 3. There is a lot of new code; nobody has demonstrated smaller changes
> > >    with similar effects.
> > > 
> > > Our options, accordingly, are:
> > > 1. Given the amount of evidence, the reported improvements will likely
> > >    materialize for a wide range of workloads.
> > > 2. Gauging the interest from the past discussions [14][15][16], the
> > >    new features will likely be put to use for both personal computers
> > >    and data centers.
> > > 3. Based on Google's track record, the new code will likely be well
> > >    maintained in the long term. It'd be more difficult if not
> > >    impossible to achieve similar effects on top of the existing
> > >    design.
> > 
> > Hi Andrew, Linus,
> > 
> > Can you please take a look at this patchset and let me know if it's
> > 5.17 material?
> > 
> > My goal is to get it merged asap so that users can reap the benefits
> > and I can push the sequels. Please examine the data provided -- I
> > think the unprecedented coverage and the magnitude of the improvements
> > warrant a green light.
> 
> Downstream kernel maintainers who have been carrying MGLRU for more than
> 3 versions, can you please provide your Acked-by tags?
> 
> Having this patchset in the mainline will make your job easier :)
> 
>    Alexandre - the XanMod Kernel maintainer
>                https://xanmod.org
>    
>    Brian     - the Chrome OS kernel memory maintainer
>                https://www.chromium.org
>    
>    Jan       - the Arch Linux Zen kernel maintainer
>                https://archlinux.org
>    
>    Steven    - the Liquorix kernel maintainer
>                https://liquorix.net
>    
>    Suleiman  - the ARCVM (Android downstream) kernel memory maintainer
>                https://chromium.googlesource.com/chromiumos/third_party/kernel
> 
> Also my gratitude to those who have helped test MGLRU:
> 
>    Daniel - researcher at Michigan Tech
>             benchmarked memcached
>    
>    Holger - who has been testing/patching/contributing to various
>             subsystems since ~2008
>    
>    Shuang - researcher at University of Rochester
>             benchmarked fio and provided a report
>    
>    Sofia  - EDI https://www.edi.works
>             benchmarked the top eight memory hogs and provided reports
> 
> Can you please provide your Tested-by tags? This will ensure the credit
> for your contributions.
> 
> Thanks!

This feature has been a huge improvement for desktop linux, system
responsiveness has hit a new level high memory pressure.  Thanks Yu!

Acked-by: Steven Barrett <steven@liquorix.net>
Brian Geffon Jan. 19, 2022, 10:25 p.m. UTC | #29
On Tue, Jan 11, 2022 at 3:41 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
> > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
> > > TLDR
> > > ====
> > > The current page reclaim is too expensive in terms of CPU usage and it
> > > often makes poor choices about what to evict. This patchset offers an
> > > alternative solution that is performant, versatile and
> > > straightforward.
> >
> > <snipped>
> >
> > > Summery
> > > =======
> > > The facts are:
> > > 1. The independent lab results and the real-world applications
> > >    indicate substantial improvements; there are no known regressions.
> > > 2. Thrashing prevention, working set estimation and proactive reclaim
> > >    work out of the box; there are no equivalent solutions.
> > > 3. There is a lot of new code; nobody has demonstrated smaller changes
> > >    with similar effects.
> > >
> > > Our options, accordingly, are:
> > > 1. Given the amount of evidence, the reported improvements will likely
> > >    materialize for a wide range of workloads.
> > > 2. Gauging the interest from the past discussions [14][15][16], the
> > >    new features will likely be put to use for both personal computers
> > >    and data centers.
> > > 3. Based on Google's track record, the new code will likely be well
> > >    maintained in the long term. It'd be more difficult if not
> > >    impossible to achieve similar effects on top of the existing
> > >    design.
> >
> > Hi Andrew, Linus,
> >
> > Can you please take a look at this patchset and let me know if it's
> > 5.17 material?
> >
> > My goal is to get it merged asap so that users can reap the benefits
> > and I can push the sequels. Please examine the data provided -- I
> > think the unprecedented coverage and the magnitude of the improvements
> > warrant a green light.
>
> Downstream kernel maintainers who have been carrying MGLRU for more than
> 3 versions, can you please provide your Acked-by tags?
>
> Having this patchset in the mainline will make your job easier :)
>
>    Alexandre - the XanMod Kernel maintainer
>                https://xanmod.org
>
>    Brian     - the Chrome OS kernel memory maintainer
>                https://www.chromium.org

MGLRU has been maturing in ChromeOS for quite some time, we've
maintained it in a number of different kernels between 4.14 and 5.15,
and it's become the default
for tens of millions of users. We've seen substantial improvements in
terms of CPU utilization and memory pressure resulting in fewer OOM
kills and reduced UI latency. I would love to see this make it
upstream so more desktop users can benefit.

Acked-by: Brian Geffon <bgeffon@google.com>


>
>    Jan       - the Arch Linux Zen kernel maintainer
>                https://archlinux.org
>
>    Steven    - the Liquorix kernel maintainer
>                https://liquorix.net
>
>    Suleiman  - the ARCVM (Android downstream) kernel memory maintainer
>                https://chromium.googlesource.com/chromiumos/third_party/kernel
>
> Also my gratitude to those who have helped test MGLRU:
>
>    Daniel - researcher at Michigan Tech
>             benchmarked memcached
>
>    Holger - who has been testing/patching/contributing to various
>             subsystems since ~2008
>
>    Shuang - researcher at University of Rochester
>             benchmarked fio and provided a report
>
>    Sofia  - EDI https://www.edi.works
>             benchmarked the top eight memory hogs and provided reports
>
> Can you please provide your Tested-by tags? This will ensure the credit
> for your contributions.
>
> Thanks!
Barry Song Jan. 23, 2022, 5:43 a.m. UTC | #30
On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <yuzhao@google.com> wrote:
>
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and it
> often makes poor choices about what to evict. This patchset offers an
> alternative solution that is performant, versatile and
> straightforward.
>
> Design objectives
> =================
> The design objectives are:
> 1. Better representation of access recency
> 2. Try to profit from spatial locality
> 3. Clear fast path making obvious choices
> 4. Simple self-correcting heuristics
>
> The representation of access recency is at the core of all LRU
> approximations. The multigenerational LRU (MGLRU) divides pages into
> multiple lists (generations), each having bounded access recency (a
> time interval). Generations establish a common frame of reference and
> help make better choices, e.g., between different memcgs on a computer
> or different computers in a data center (for cluster job scheduling).
>
> Exploiting spatial locality improves the efficiency when gathering the
> accessed bit. A rmap walk targets a single page and doesn't try to
> profit from discovering an accessed PTE. A page table walk can sweep
> all hotspots in an address space, but its search space can be too
> large to make a profit. The key is to optimize both methods and use
> them in combination. (PMU is another option for further exploration.)
>
> Fast path reduces code complexity and runtime overhead. Unmapped pages
> don't require TLB flushes; clean pages don't require writeback. These
> facts are only helpful when other conditions, e.g., access recency,
> are similar. With generations as a common frame of reference,
> additional factors stand out. But obvious choices might not be good
> choices; thus self-correction is required (the next objective).
>
> The benefits of simple self-correcting heuristics are self-evident.
> Again with generations as a common frame of reference, this becomes
> attainable. Specifically, pages in the same generation are categorized
> based on additional factors, and a closed-loop control statistically
> compares the refault percentages across all categories and throttles
> the eviction of those that have higher percentages.
>
> Patchset overview
> =================
> 1. mm: x86, arm64: add arch_has_hw_pte_young()
> 2. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> Materializing hardware optimizations when trying to clear the accessed
> bit in many PTEs. If hardware automatically sets the accessed bit in
> PTEs, there is no need to worry about bursty page faults (emulating
> the accessed bit). If it also sets the accessed bit in non-leaf PMD
> entries, there is no need to search the PTE table pointed to by a PMD
> entry that doesn't have the accessed bit set.
>
> 3. mm/vmscan.c: refactor shrink_node()
> A minor refactor.
>
> 4. mm: multigenerational lru: groundwork
> Adding the basic data structure and the functions to initialize it and
> insert/remove pages.
>
> 5. mm: multigenerational lru: mm_struct list
> An infra keeps track of mm_struct's for page table walkers and
> provides them with optimizations, i.e., switch_mm() tracking and Bloom
> filters.
>
> 6. mm: multigenerational lru: aging
> 7. mm: multigenerational lru: eviction
> "The page reclaim" is a producer/consumer model. "The aging" produces
> cold pages, whereas "the eviction " consumes them. Cold pages flow
> through generations. The aging uses the mm_struct list infra to sweep
> dense hotspots in page tables. During a page table walk, the aging
> clears the accessed bit and tags accessed pages with the youngest
> generation number. The eviction sorts those pages when it encounters
> them. For pages in the oldest generation, eviction walks the rmap to
> check the accessed bit one more time before evicting them. During an
> rmap walk, the eviction feeds dense hotspots back to the aging. Dense
> hotspots flow through the Bloom filters. For pages not mapped in page
> tables, the eviction uses the PID controller to statistically
> determine whether they have higher refaults. If so, the eviction
> throttles their eviction by moving them to the next generation (the
> second oldest).
>
> 8. mm: multigenerational lru: user interface
> The knobs to turn on/off MGLRU and provide the userspace with
> thrashing prevention, working set estimation (the aging) and proactive
> reclaim (the eviction).
>
> 9. mm: multigenerational lru: Kconfig
> The Kconfig options.
>
> Benchmark results
> =================
> Independent lab results
> -----------------------
> Based on the popularity of searches [01] and the memory usage in
> Google's public cloud, the most popular open-source memory-hungry
> applications, in alphabetical order, are:
>       Apache Cassandra      Memcached
>       Apache Hadoop         MongoDB
>       Apache Spark          PostgreSQL
>       MariaDB (MySQL)       Redis
>
> An independent lab evaluated MGLRU with the most widely used benchmark
> suites for the above applications. They posted 960 data points along
> with kernel metrics and perf profiles collected over more than 500
> hours of total benchmark time. Their final reports show that, with 95%
> confidence intervals (CIs), the above applications all performed
> significantly better for at least part of their benchmark matrices.
>
> On 5.14:
> 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
>    less wall time to sort three billion random integers, respectively,
>    under the medium- and the high-concurrency conditions, when
>    overcommitting memory. There were no statistically significant
>    changes in wall time for the rest of the benchmark matrix.
> 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
>    more transactions per minute (TPM), respectively, under the medium-
>    and the high-concurrency conditions, when overcommitting memory.
>    There were no statistically significant changes in TPM for the rest
>    of the benchmark matrix.
> 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
>    and [21.59, 30.02]% more operations per second (OPS), respectively,
>    for sequential access, random access and Gaussian (distribution)
>    access, when THP=always; 95% CIs [13.85, 15.97]% and
>    [23.94, 29.92]% more OPS, respectively, for random access and
>    Gaussian access, when THP=never. There were no statistically
>    significant changes in OPS for the rest of the benchmark matrix.
> 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
>    [2.16, 3.55]% more operations per second (OPS), respectively, for
>    exponential (distribution) access, random access and Zipfian
>    (distribution) access, when underutilizing memory; 95% CIs
>    [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
>    respectively, for exponential access, random access and Zipfian
>    access, when overcommitting memory.
>
> On 5.15:
> 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
>    and [4.11, 7.50]% more operations per second (OPS), respectively,
>    for exponential (distribution) access, random access and Zipfian
>    (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
>    [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
>    exponential access, random access and Zipfian access, when swap was
>    on.
> 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
>    less average wall time to finish twelve parallel TeraSort jobs,
>    respectively, under the medium- and the high-concurrency
>    conditions, when swap was on. There were no statistically
>    significant changes in average wall time for the rest of the
>    benchmark matrix.
> 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
>    minute (TPM) under the high-concurrency condition, when swap was
>    off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
>    respectively, under the medium- and the high-concurrency
>    conditions, when swap was on. There were no statistically
>    significant changes in TPM for the rest of the benchmark matrix.
> 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
>    [11.47, 19.36]% more total operations per second (OPS),
>    respectively, for sequential access, random access and Gaussian
>    (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
>    [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
>    for sequential access, random access and Gaussian access, when
>    THP=never.
>
> Our lab results
> ---------------
> To supplement the above results, we ran the following benchmark suites
> on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
> are popular among MM developers, but we prefer large-scale A/B
> experiments to validate improvements.)
>       fs_fio_bench_hdd_mq      pft
>       fs_lmbench               pgsql-hammerdb
>       fs_parallelio            redis
>       fs_postmark              stream
>       hackbench                sysbenchthread
>       kernbench                tpcc_spark
>       memcached                unixbench
>       multichase               vm-scalability
>       mutilate                 will-it-scale
>       nginx
>
> [01] https://trends.google.com
> [02] https://lore.kernel.org/linux-mm/20211102002002.92051-1-bot@edi.works/
> [03] https://lore.kernel.org/linux-mm/20211009054315.47073-1-bot@edi.works/
> [04] https://lore.kernel.org/linux-mm/20211021194103.65648-1-bot@edi.works/
> [05] https://lore.kernel.org/linux-mm/20211109021346.50266-1-bot@edi.works/
> [06] https://lore.kernel.org/linux-mm/20211202062806.80365-1-bot@edi.works/
> [07] https://lore.kernel.org/linux-mm/20211209072416.33606-1-bot@edi.works/
> [08] https://lore.kernel.org/linux-mm/20211218071041.24077-1-bot@edi.works/
> [09] https://lore.kernel.org/linux-mm/20211122053248.57311-1-bot@edi.works/
> [10] https://lore.kernel.org/linux-mm/20220104202247.2903702-1-yuzhao@google.com/
>
> Read-world applications
> =======================
> Third-party testimonials
> ------------------------
> Konstantin wrote [11]:
>    I have Archlinux with 8G RAM + zswap + swap. While developing, I
>    have lots of apps opened such as multiple LSP-servers for different
>    langs, chats, two browsers, etc... Usually, my system gets quickly
>    to a point of SWAP-storms, where I have to kill LSP-servers,
>    restart browsers to free memory, etc, otherwise the system lags
>    heavily and is barely usable.
>
>    1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
>    patchset, and I started up by opening lots of apps to create memory
>    pressure, and worked for a day like this. Till now I had *not a
>    single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
>    getting to the point of 3G in SWAP before without a single
>    SWAP-storm.
>
> The Arch Linux Zen kernel [12] has been using MGLRU since 5.12. Many
> of its users reported their positive experiences to me, e.g., Shivodit
> wrote:
>    I've tried the latest Zen kernel (5.14.13-zen1-1-zen in the
>    archlinux testing repos), everything's been smooth so far. I also
>    decided to copy a large volume of files to check performance under
>    I/O load, and everything went smoothly - no stuttering was present,
>    everything was responsive.
>
> Large-scale deployments
> -----------------------
> We've rolled out MGLRU to tens of millions of Chrome OS users and
> about a million Android users. Google's fleetwide profiling [13] shows
> an overall 40% decrease in kswapd CPU usage, in addition to

Hi Yu,

Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64?
And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG.
Does it help a lot in decreasing the cpu usage? If so, this might be
a good proof that arm64 also needs this hardware feature?
In short, I am curious how much the improvement in this patchset depends
on the hardware ability of NONLEAF_PMD_YOUNG.

> improvements in other UX metrics, e.g., an 85% decrease in the number
> of low-memory kills at the 75th percentile and an 18% decrease in
> rendering latency at the 50th percentile.
>
> [11] https://lore.kernel.org/linux-mm/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
> [12] https://github.com/zen-kernel/zen-kernel/
> [13] https://research.google/pubs/pub44271/
>

Thanks
Barry
Yu Zhao Jan. 25, 2022, 6:48 a.m. UTC | #31
On Sun, Jan 23, 2022 at 06:43:06PM +1300, Barry Song wrote:
> On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <yuzhao@google.com> wrote:

<snipped>

> > Large-scale deployments
> > -----------------------
> > We've rolled out MGLRU to tens of millions of Chrome OS users and
> > about a million Android users. Google's fleetwide profiling [13] shows
> > an overall 40% decrease in kswapd CPU usage, in addition to
> 
> Hi Yu,
> 
> Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64?
> And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG.
> Does it help a lot in decreasing the cpu usage?

Hi Barry,

The fleet-wide profiling data I shared was from x86. For arm64, I only
have data from synthetic benchmarks at the moment, and it also shows
similar improvements.

For Chrome OS (individual users), walk_pte_range(), the function that
would benefit from ARCH_HAS_NONLEAF_PMD_YOUNG, only uses a small
portion (<4%) of kswapd CPU time. So ARCH_HAS_NONLEAF_PMD_YOUNG isn't
that helpful.

> If so, this might be
> a good proof that arm64 also needs this hardware feature?
> In short, I am curious how much the improvement in this patchset depends
> on the hardware ability of NONLEAF_PMD_YOUNG.

For data centers, I do think ARCH_HAS_NONLEAF_PMD_YOUNG has some value.
In addition to cold/hot memory scanning, there are other use cases like
dirty tracking, which can benefit from the accessed bit on non-leaf
entries. I know some proprietary software uses this capability on x86
for different purposes than this patchset does. And AFAIK, x86 is the
only arch that supports this capability, e.g., risc-v and ppc can only
set the accessed bit in PTEs.

In fact, I've discussed this with one of the arm maintainers Will. So
please check with him too if you are interested in moving forward with
the idea. I might be able to provide with additional data if you need
it to make a decision.

Thanks.
Barry Song Jan. 28, 2022, 8:54 a.m. UTC | #32
On Tue, Jan 25, 2022 at 7:48 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Sun, Jan 23, 2022 at 06:43:06PM +1300, Barry Song wrote:
> > On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <yuzhao@google.com> wrote:
>
> <snipped>
>
> > > Large-scale deployments
> > > -----------------------
> > > We've rolled out MGLRU to tens of millions of Chrome OS users and
> > > about a million Android users. Google's fleetwide profiling [13] shows
> > > an overall 40% decrease in kswapd CPU usage, in addition to
> >
> > Hi Yu,
> >
> > Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64?
> > And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG.
> > Does it help a lot in decreasing the cpu usage?
>
> Hi Barry,
>
> The fleet-wide profiling data I shared was from x86. For arm64, I only
> have data from synthetic benchmarks at the moment, and it also shows
> similar improvements.
>
> For Chrome OS (individual users), walk_pte_range(), the function that
> would benefit from ARCH_HAS_NONLEAF_PMD_YOUNG, only uses a small
> portion (<4%) of kswapd CPU time. So ARCH_HAS_NONLEAF_PMD_YOUNG isn't
> that helpful.

Hi Yu,
Thanks!

In the current kernel, depending on reverse mapping, while memory is
under pressure,
the cpu usage of kswapd can be very very high especially while a lot of pages
have large mapcount, thus a huge reverse mapping cost.

Regarding  <4%, I guess the figure came from machines with NONLEAF_PMD_YOUNG?
In this case, we can skip many PTE scans while PMD has no accessed bit
set. But for
a machine without NONLEAF, will the figure of cpu usage be much larger?

>
> > If so, this might be
> > a good proof that arm64 also needs this hardware feature?
> > In short, I am curious how much the improvement in this patchset depends
> > on the hardware ability of NONLEAF_PMD_YOUNG.
>
> For data centers, I do think ARCH_HAS_NONLEAF_PMD_YOUNG has some value.
> In addition to cold/hot memory scanning, there are other use cases like
> dirty tracking, which can benefit from the accessed bit on non-leaf
> entries. I know some proprietary software uses this capability on x86
> for different purposes than this patchset does. And AFAIK, x86 is the
> only arch that supports this capability, e.g., risc-v and ppc can only
> set the accessed bit in PTEs.

Yep. NONLEAF is a nice feature.

btw, page table should have a separate DIRTY bit, right? wouldn't dirty page
tracking depend on the DIRTY bit rather than the accessed bit? so x86 also has
NONLEAF dirty bit? Or they are scanning accessed bit of PMD before
scanning DIRTY bits of PTEs?

>
> In fact, I've discussed this with one of the arm maintainers Will. So
> please check with him too if you are interested in moving forward with
> the idea. I might be able to provide with additional data if you need
> it to make a decision.

I am interested in running it and have some data without NONLEAF
especially while free memory is very limited and the system has memory
thrashing.

>
> Thanks.

Thanks
Barry
Yu Zhao Feb. 8, 2022, 9:16 a.m. UTC | #33
On Fri, Jan 28, 2022 at 09:54:09PM +1300, Barry Song wrote:
> On Tue, Jan 25, 2022 at 7:48 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Sun, Jan 23, 2022 at 06:43:06PM +1300, Barry Song wrote:
> > > On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > <snipped>
> >
> > > > Large-scale deployments
> > > > -----------------------
> > > > We've rolled out MGLRU to tens of millions of Chrome OS users and
> > > > about a million Android users. Google's fleetwide profiling [13] shows
> > > > an overall 40% decrease in kswapd CPU usage, in addition to
> > >
> > > Hi Yu,
> > >
> > > Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64?
> > > And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG.
> > > Does it help a lot in decreasing the cpu usage?
> >
> > Hi Barry,
> >
> > The fleet-wide profiling data I shared was from x86. For arm64, I only
> > have data from synthetic benchmarks at the moment, and it also shows
> > similar improvements.
> >
> > For Chrome OS (individual users), walk_pte_range(), the function that
> > would benefit from ARCH_HAS_NONLEAF_PMD_YOUNG, only uses a small
> > portion (<4%) of kswapd CPU time. So ARCH_HAS_NONLEAF_PMD_YOUNG isn't
> > that helpful.
> 
> Hi Yu,
> Thanks!
> 
> In the current kernel, depending on reverse mapping, while memory is
> under pressure,
> the cpu usage of kswapd can be very very high especially while a lot of pages
> have large mapcount, thus a huge reverse mapping cost.

Agreed. I've posted v7 which includes kswapd profiles collected from an
arm64 v8.2 laptop under memory pressure.

> Regarding  <4%, I guess the figure came from machines with NONLEAF_PMD_YOUNG?

No, it's from Snapdragon 7c. Please see the kswapd profiles in v7.

> In this case, we can skip many PTE scans while PMD has no accessed bit
> set. But for
> a machine without NONLEAF, will the figure of cpu usage be much larger?

So NONLEAF_PMD_YOUNG at most can save 4% CPU usage from kswapd. But
this definitely can vary, depending on the workloads.

> > > If so, this might be
> > > a good proof that arm64 also needs this hardware feature?
> > > In short, I am curious how much the improvement in this patchset depends
> > > on the hardware ability of NONLEAF_PMD_YOUNG.
> >
> > For data centers, I do think ARCH_HAS_NONLEAF_PMD_YOUNG has some value.
> > In addition to cold/hot memory scanning, there are other use cases like
> > dirty tracking, which can benefit from the accessed bit on non-leaf
> > entries. I know some proprietary software uses this capability on x86
> > for different purposes than this patchset does. And AFAIK, x86 is the
> > only arch that supports this capability, e.g., risc-v and ppc can only
> > set the accessed bit in PTEs.
> 
> Yep. NONLEAF is a nice feature.
> 
> btw, page table should have a separate DIRTY bit, right?

Yes.

> wouldn't dirty page
> tracking depend on the DIRTY bit rather than the accessed bit?

It depends on the goal.

> so x86 also has
> NONLEAF dirty bit?

No.

> Or they are scanning accessed bit of PMD before
> scanning DIRTY bits of PTEs?

A mandatory sync to disk must use the dirty bit to ensure data
integrity. But for a voluntary sync to disk, it can use the accessed
bit to narrow the search of dirty pages.

A mandatory sync is used to free specific dirty pages. A voluntary sync
is used to keep the number of dirty pages low in general and it doesn't
target any specific dirty pages.

> > In fact, I've discussed this with one of the arm maintainers Will. So
> > please check with him too if you are interested in moving forward with
> > the idea. I might be able to provide with additional data if you need
> > it to make a decision.
> 
> I am interested in running it and have some data without NONLEAF
> especially while free memory is very limited and the system has memory
> thrashing.

The v7 has a switch to disable this feature on x86. If you can run your
workloads on x86, then it might be able to help you measure the difference.