mbox series

[v4,00/11] Multigenerational LRU Framework

Message ID 20210818063107.2696454-1-yuzhao@google.com (mailing list archive)
Headers show
Series Multigenerational LRU Framework | expand

Message

Yu Zhao Aug. 18, 2021, 6:30 a.m. UTC
TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.

Repo
====
git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/81/1281/1

Problems
========
Active/inactive
---------------
Data centers need to predict whether a job can successfully land on a
machine without actually impacting the existing jobs. The granularity
of the active/inactive is too coarse to be useful for job schedulers
to make such decisions. In addition, data centers need to monitor
their memory utilization for horizontal scaling. The active/inactive
are relative terms and therefore cannot give any insight into a pool
of machines, e.g., aggregating the active/inactive across multiple
machines without a common frame of reference yields no meaningful
results.

Phones and laptops need to make good choices about what to evict
because they are more sensitive to major faults and power consumption.
Major faults can cause janks, i.e., slow UI renderings, and negatively
impact user experience. The selection between anon and file types has
been suboptimal because it is difficult to compare the access patterns
of the two types. On phones and laptops, executable pages are
frequently evicted despite the fact that there are many less
frequently used anon pages. Conversely, on workstations building large
projects, anon pages are sometimes swapped out while there are many
less recently used file pages.

Fundamentally, the notion of active/inactive has very limited ability
to measure temporal locality.

Rmap walk
---------
Traversing a list of pages and searching the rmap for PTEs mapping
each page can be very expensive because those pages are likely to be
unrelated. For workloads using a high percentage of anon memory, the
rmap becomes a bottleneck in page reclaim. For example, kswapd can
easily spend more CPU time in the rmap than in anything else on
laptops running Chrome. And the kernel can spend more CPU time in the
rmap than in any other functions on servers that heavily overcommit
anon memory.

Simply put, it does not take advantage of spatial locality when using
the rmap to test the accessed bit over a large number of pages.

Solutions
=========
Generations
-----------
This solution introduces a temporal dimension. Each generation is a
dot on the timeline and its population includes all mapped pages that
have been accessed since the birth of this generation.

All eviction choices are made based on generation numbers, which are
simple and yet effective. A large number of pages can be spread out
across many generations. Since each generation is timestamped at
birth, its population is aggregatable across different machines. This
is especially useful for data centers that require working set
estimation and proactive reclaim.

Page table walk
---------------
Each walk traverses an mm_struct list to scan PTEs for accessed pages
only. Processes that have been sleeping since the last walk are
skipped. The cost of this solution is roughly proportional to the
number of accessed pages. Since page tables usually have good spatial
locality for workloads using a high percentage of anon memory, the end
result is generally a significant reduction in kswapd CPU usage.

Note that page table walks are conditional and therefore do not
replace the rmap. For workloads that have sparse mappings, this
solution falls back to the rmap.

Use cases
=========
Page cache overcommit
---------------------
Tiers within each generation are specifically designed to improve the
performance of page cache under memory pressure. The fio/io_uring
benchmark shows 14% increase in IOPS for buffered I/O.

Without this patchset, the profile of fio/io_uring looks like:
  12.03%  __page_cache_alloc
   6.53%  shrink_active_list
   2.53%  mark_page_accessed

With this patchset, it looks like:
   9.45%  __page_cache_alloc
   0.52%  mark_page_accessed

Essentially, the idea of tiers is a feedback loop based on trial and
error. Instead of unconditionally moving file pages to the active list
upon the second access, this solution monitors refaults and
conditionally protects file pages with outlying refaults.

Anon memory overcommit
----------------------
Our real-world benchmark that browses popular websites in multiple
Chrome tabs demonstrates 51% less CPU usage from kswapd and 52% (full)
less PSI.

Without this patchset, the profile of kswapd looks like:
  31.03%  page_vma_mapped_walk
  25.59%  lzo1x_1_do_compress
   4.63%  do_raw_spin_lock
   3.89%  vma_interval_tree_iter_next
   3.33%  vma_interval_tree_subtree_search

With this patchset, it looks like:
  49.36%  lzo1x_1_do_compress
   4.54%  page_vma_mapped_walk
   4.45%  memset_erms
   3.47%  walk_pte_range
   2.88%  zram_bvec_rw

In addition, direct reclaim latency is reduced by 22% at 99th
percentile and the number of refaults is reduced by 7%. Both metrics
are important to phones and laptops as they are highly correlated to
user experience.

Working set estimation
----------------------
Userspace can invoke the aging by writing "+ memcg_id node_id max_gen
[swappiness]" to /sys/kernel/debug/lru_gen. Reading this debugfs
interface returns the birth time and the population of each
generation.

Given a pool of machines, by periodically invoking the aging, a job
scheduler is able to rank these machines based on the sizes of their
working sets and in turn selects the most ideal ones to land new jobs.

Proactive reclaim
-----------------
Userspace can invoke the eviction by writing "- memcg_id node_id
min_gen [swappiness] [nr_to_reclaim]" to /sys/kernel/debug/lru_gen.
Multiple command lines are supported, as is concatenation with
delimiters "," and ";".

A typical use case is that a job scheduler invokes the eviction in
anticipation of new jobs. The savings from proactive reclaim can
provide certain SLA to landing these new jobs.

Yu Zhao (11):
  mm: x86, arm64: add arch_has_hw_pte_young()
  mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  mm/vmscan.c: refactor shrink_node()
  mm: multigenerational lru: groundwork
  mm: multigenerational lru: protection
  mm: multigenerational lru: mm_struct list
  mm: multigenerational lru: aging
  mm: multigenerational lru: eviction
  mm: multigenerational lru: user interface
  mm: multigenerational lru: Kconfig
  mm: multigenerational lru: documentation

 Documentation/vm/index.rst          |    1 +
 Documentation/vm/multigen_lru.rst   |  134 ++
 arch/Kconfig                        |    9 +
 arch/arm64/include/asm/cpufeature.h |   19 +-
 arch/arm64/include/asm/pgtable.h    |   10 +-
 arch/arm64/kernel/cpufeature.c      |   19 +
 arch/arm64/mm/proc.S                |   12 -
 arch/arm64/tools/cpucaps            |    1 +
 arch/x86/Kconfig                    |    1 +
 arch/x86/include/asm/pgtable.h      |    9 +-
 arch/x86/mm/pgtable.c               |    5 +-
 fs/exec.c                           |    2 +
 fs/fuse/dev.c                       |    3 +-
 include/linux/cgroup.h              |   15 +-
 include/linux/memcontrol.h          |    9 +
 include/linux/mm.h                  |   34 +
 include/linux/mm_inline.h           |  201 ++
 include/linux/mm_types.h            |  107 ++
 include/linux/mmzone.h              |  103 ++
 include/linux/nodemask.h            |    1 +
 include/linux/oom.h                 |   16 +
 include/linux/page-flags-layout.h   |   19 +-
 include/linux/page-flags.h          |    4 +-
 include/linux/pgtable.h             |   16 +-
 include/linux/sched.h               |    3 +
 include/linux/swap.h                |    1 +
 kernel/bounds.c                     |    3 +
 kernel/cgroup/cgroup-internal.h     |    1 -
 kernel/exit.c                       |    1 +
 kernel/fork.c                       |   10 +
 kernel/kthread.c                    |    1 +
 kernel/sched/core.c                 |    2 +
 mm/Kconfig                          |   59 +
 mm/huge_memory.c                    |    3 +-
 mm/memcontrol.c                     |   28 +
 mm/memory.c                         |   21 +-
 mm/mm_init.c                        |    6 +-
 mm/mmzone.c                         |    2 +
 mm/oom_kill.c                       |    4 +-
 mm/rmap.c                           |    7 +
 mm/swap.c                           |   55 +-
 mm/swapfile.c                       |    2 +
 mm/vmscan.c                         | 2674 ++++++++++++++++++++++++++-
 mm/workingset.c                     |  119 +-
 44 files changed, 3591 insertions(+), 161 deletions(-)
 create mode 100644 Documentation/vm/multigen_lru.rst

Comments

bot@edi.works Oct. 9, 2021, 5:43 a.m. UTC | #1
Kernel / MariaDB benchmark with MGLRU

TLDR
====
With the MGLRU, MariaDB achieved 95% CIs [5.24, 10.71]% and [20.22,
25.97]% more transactions per minute (TPM), respectively, under the
medium- and high-concurrency conditions when slightly overcommitting
memory. There were no statistically significant changes in TPM under
other conditions.

Rationale
=========
Memory overcommit can improve utilization and, if not overdone, can
also increase throughput. The challenges are estimating working sets
and optimizing page reclaim. The risks are performance degradations
and OOM kills. Unless overcoming the challenges, the only way to
reduce the risks is to overprovision memory.

MariaDB is one of the most popular open-source RDBMSs. HammerDB is
the leading open-source benchmarking software derived from the TPC
specifications. OLTP is the most important use case for RDBMSs.

Matrix
======
Kernels: version [+ patchset]
* Baseline: 5.14
* Patched: 5.14 + MGLRU

Memory conditions: % of memory size
* Underutilizing: ~10% on inactive file list
* Overcommitting: ~10% swapped out

Concurrency conditions: average # of users per CPU
* Low: ~3
* Medium: ~13
* High: ~19

Total configurations: 12
Data points per configuration: 10
Total run duration (minutes) per data point: ~45

Procedure
=========
The latest MGLRU patchset for the 5.14 kernel is available at
git fetch https://linux-mm.googlesource.com/page-reclaim \
  refs/changes/30/1430/1

Baseline and patched 5.14 kernel images are available at
https://drive.google.com/drive/folders/1eMkQleAFGkP2vzM_JyRA21oKE0ESHBqp

<install and configure OS>
hammerdbcli auto prep_tpcc.tcl
systemctl stop mariadb
e2image <backup /mnt/data>

<for each kernel>
    grub2-set-default <baseline / patched>
    <for each memory condition>
        <update /etc/my.cnf>
        <for each concurrency condition>
            <update run_tpcc.tcl>
            <for each data point>
                systemctl stop mariadb
                e2image <restore /mnt/data>
                reboot
                hammerdbcli auto run_tpcc.tcl
                <collect TPM>

Hardware
========
Memory (GB): 64
CPU (total #): 32
NVMe SSD (GB): 1024

OS
==
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)

$ cat /proc/swaps
Filename          Type          Size          Used     Priority
/dev/nvme0n1p3    partition     32970748      0          -2

$ mount | grep data
/dev/nvme0n1p4 on /mnt/data type ext4 (rw,relatime,seclabel)

$ cat /proc/cmdline
<existing parameters> systemd.unified_cgroup_hierarchy=1

$ cat /sys/fs/cgroup/user.slice/memory.min
4294967296

$ cat /proc/sys/vm/overcommit_memory
1

MariaDB
=======
$ mysql --version
mysql  Ver 15.1 Distrib 10.3.28-MariaDB, for Linux (x86_64) using
readline 5.1

$ cat /etc/my.cnf
<existing parameters>

[mysqld]
innodb_buffer_pool_size=<50G, 60G>
innodb_doublewrite=0
innodb_flush_log_at_trx_commit=0
innodb_flush_method=O_DIRECT_NO_FSYNC
innodb_flush_neighbors=0
innodb_io_capacity=4000
innodb_io_capacity_max=20000
innodb_log_buffer_size=1G
innodb_log_file_size=20G
innodb_max_dirty_pages_pct=90
innodb_max_dirty_pages_pct_lwm=10
max_connections=1000
datadir=/mnt/data

HammerDB
========
$ hammerdbcli -h
HammerDB CLI v4.2
Copyright (C) 2003-2021 Steve Shaw
Type "help" for a list of commands
Usage: hammerdbcli [ auto [ script_to_autoload.tcl  ] ]

$ cat prep_tpcc.tcl
dbset db maria
diset connection maria_socket /var/lib/mysql/mysql.sock
diset tpcc maria_count_ware 1200
diset tpcc maria_num_vu 32
diset tpcc maria_partition true
buildschema
waittocomplete
quit

$ cat run_tpcc.tcl
dbset db maria
diset connection maria_socket /var/lib/mysql/mysql.sock
diset tpcc maria_total_iterations 20000000
diset tpcc maria_driver timed
diset tpcc maria_rampup 10
diset tpcc maria_duration 30
diset tpcc maria_allwarehouse true
vuset logtotemp 1
vuset unique 1
loadscript
vuset vu <100, 400, 600>
vucreate
vurun
runtimer 3000
Vudestroy

Results
=======
Comparing the patched with the baseline kernel, MariaDB achieved 95%
CIs [5.24, 10.71]% and [20.22, 25.97]% more TPM, respectively, under
the medium- and high-concurrency conditions when slightly
overcommitting memory. There were no statistically significant
changes in TPM under other conditions.

+--------------------+-----------------------+-----------------------+
| Mean TPM [95% CI]  | Underutilizing memory | Overcommitting memory |
+--------------------+-----------------------+-----------------------+
| Low concurrency    | 270811.6 / 271522.7   | 447933.4 / 447283.3   |
|                    | [-40.97, 1463.17]     | [-1330.61, 30.41]     |
+--------------------+-----------------------+-----------------------+
| Medium concurrency | 240212.9 / 242846.7   | 327276.6 / 353372.7   |
|                    | [-2611.38, 7878.98]   | [17149.01, 35043.19]  |
+--------------------+-----------------------+-----------------------+
| High concurrency   | 283897.8 / 283668.1   | 274069.7 / 337366.8   |
|                    | [-11538.08, 11078.68] | [55417.42, 71176.78]  |
+--------------------+-----------------------+-----------------------+
Table 1. Comparison between the baseline and patched kernels

Comparing overcommitting with underutilizing memory, MariaDB achieved
95% CIs [65.12, 65.68]% and [32.45, 40.04]% more TPM, respectively,
under the low- and medium-concurrency conditions when using the
baseline kernel; 95% CIs [64.48, 64.98]%, [43.53, 47.50]% and [16.48,
21.38]% more TPM, respectively, under the low-, medium- and
high-concurrency conditions when using the patched kernel. There were
no statistically significant changes in TPM under other conditions.

+--------------------+------------------------+----------------------+
| Mean TPM [95% CI]  | Baseline kernel        | Patched kernel       |
+--------------------+------------------------+----------------------+
| Low concurrency    | 270811.6 / 447933.4    | 271522.7 / 447283.3  |
|                    | [176362.0, 177881.6]   | [175089.3, 176431.9] |
+--------------------+------------------------+----------------------+
| Medium concurrency | 240212.9 / 327276.6    | 242846.7 / 353372.7  |
|                    | [77946.4, 96181.0]     | [105707.7, 115344.3] |
+--------------------+------------------------+----------------------+
| High concurrency   | 283897.8 / 274069.7    | 283668.1 / 337366.8  |
|                    | [-21605.703, 1949.503] | [46758.85, 60638.55] |
+--------------------+------------------------+----------------------+
Table 2. Comparison between underutilizing and overcommitting memory

Metrics collected during each run are available at
https://github.com/ediworks/KernelPerf/tree/master/mglru/mariadb/5.14

References
==========
HammerDB v4.2 New Features:
https://www.hammerdb.com/blog/uncategorized/hammerdb-v4-2-new-features
-pt1-mariadb-build-and-test-example-with-the-cli/

Appendix
========
$ cat raw_data.r
v <- c(
# baseline 50g 100vu
269531,270113,270256,270367,270393,270630,270707,271373,272291,272455,
# baseline 50g 400vu
231856,234985,235144,235552,238551,239994,244413,245255,247997,248382,
# baseline 50g 600vu
256365,271733,275966,280623,281014,283764,293327,296750,298728,300708,
# baseline 60g 100vu
446973,447383,447412,447489,447874,448046,448123,448531,448739,448764,
# baseline 60g 400vu
312427,312936,313780,321503,329554,330551,332377,333584,337105,348949,
# baseline 60g 600vu
262338,262971,266242,266489,268036,272494,279045,281472,289942,291668,
# patched 50g 100vu
270621,270913,271026,271137,271517,271616,271699,272117,272218,272363,
# patched 50g 400vu
233314,238265,238722,240540,241676,245204,245688,247440,248417,249201,
# patched 50g 600vu
271114,271928,277562,279455,282074,285515,287836,288508,289451,303238,
# patched 60g 100vu
445923,446178,446837,446889,447331,447480,447823,447999,448145,448228,
# patched 60g 400vu
345705,349373,350832,351229,351758,352520,355130,355247,357762,364171,
# patched 60g 600vu
330860,334705,336001,337291,338326,338361,338970,339163,339784,340207
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (m in 1:2) {
    for (c in 1:3) {
        r <- t.test(a[, c, m, 1], a[, c, m, 2])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("m%d c%d: no significance", m, c)
        } else {
            s <- sprintf("m%d c%d: [%.2f, %.2f]%%", m, c, -p[2],
-p[1])
        }
        print(s)
    }
}

# 50g vs 60g
for (k in 1:2) {
    for (c in 1:3) {
        r <- t.test(a[, c, 1, k], a[, c, 2, k])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("k%d c%d: no significance", k, c)
        } else {
            s <- sprintf("k%d c%d: [%.2f, %.2f]%%", k, c, -p[2], -p[1])
        }
        print(s)
    }
}

$ R -q -s -f raw_data.r

        Welch Two Sample t-test

data:  a[, c, m, 1] and a[, c, m, 2]
t = -2.0139, df = 15.122, p-value = 0.06217
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1463.17673    40.97673
sample estimates:
mean of x mean of y
 270811.6  271522.7

[1] "50g 100vu: no significance"

        Welch Two Sample t-test

data:  a[, c, m, 1] and a[, c, m, 2]
t = -1.0564, df = 17.673, p-value = 0.305
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7878.98  2611.38
sample estimates:
mean of x mean of y
 240212.9  242846.7

[1] "50g 400vu: no significance"

        Welch Two Sample t-test

data:  a[, c, m, 1] and a[, c, m, 2]
t = 0.043083, df = 15.895, p-value = 0.9662
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11078.68  11538.08
sample estimates:
mean of x mean of y
 283897.8  283668.1

[1] "50g 600vu: no significance"

        Welch Two Sample t-test

data:  a[, c, m, 1] and a[, c, m, 2]
t = 2.0171, df = 16.831, p-value = 0.05993
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -30.41577 1330.61577
sample estimates:
mean of x mean of y
 447933.4  447283.3

[1] "60g 100vu: no significance"

        Welch Two Sample t-test

data:  a[, c, m, 1] and a[, c, m, 2]
t = -6.3473, df = 12.132, p-value = 3.499e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -35043.19 -17149.01
sample estimates:
mean of x mean of y
 327276.6  353372.7

[1] "60g 400vu: [5.24, 10.71]%"

        Welch Two Sample t-test

data:  a[, c, m, 1] and a[, c, m, 2]
t = -17.844, df = 10.233, p-value = 4.822e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -71176.78 -55417.42
sample estimates:
mean of x mean of y
 274069.7  337366.8

[1] "60g 600vu: [20.22, 25.97]%"

        Welch Two Sample t-test

data:  a[, c, 1, k] and a[, c, 2, k]
t = -495.48, df = 15.503, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -177881.6 -176362.0
sample estimates:
mean of x mean of y
 270811.6  447933.4

[1] "baseline 100vu: [65.12, 65.68]%"

        Welch Two Sample t-test

data:  a[, c, 1, k] and a[, c, 2, k]
t = -20.601, df = 13.182, p-value = 2.062e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -96181.0 -77946.4
sample estimates:
mean of x mean of y
 240212.9  327276.6

[1] "baseline 400vu: [32.45, 40.04]%"

        Welch Two Sample t-test

data:  a[, c, 1, k] and a[, c, 2, k]
t = 1.7607, df = 16.986, p-value = 0.09628
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1949.503 21605.703
sample estimates:
mean of x mean of y
 283897.8  274069.7

[1] "baseline 600vu: no significance"

        Welch Two Sample t-test

data:  a[, c, 1, k] and a[, c, 2, k]
t = -553.68, df = 16.491, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -176431.9 -175089.3
sample estimates:
mean of x mean of y
 271522.7  447283.3

[1] "patched 100vu: [64.48, 64.98]%"

        Welch Two Sample t-test

data:  a[, c, 1, k] and a[, c, 2, k]
t = -48.194, df = 17.992, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -115344.3 -105707.7
sample estimates:
mean of x mean of y
 242846.7  353372.7

[1] "patched 400vu: [43.53, 47.50]%"

        Welch Two Sample t-test

data:  a[, c, 1, k] and a[, c, 2, k]
t = -17.109, df = 10.6, p-value = 4.629e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -60638.55 -46758.85
sample estimates:
mean of x mean of y
 283668.1  337366.8

[1] "patched 600vu: [16.48, 21.38]%"
bot@edi.works Oct. 21, 2021, 7:41 p.m. UTC | #2
Kernel / Memcached benchmark with MGLRU

TLDR
====
With the MGLRU, Memcached achieved 95% CIs [23.54, 32.25]%, [20.76,
41.61]%, [13.85, 15.97]%, [21.59, 30.02]% and [23.94, 29.92]% more
operations per second (OPS), respectively, for sequential access w/
THP=always, random access w/ THP=always, random access w/ THP=never,
Gaussian access w/ THP=always and Gaussian access w/ THP=never. There
were no statistically significant changes in OPS for sequential
access w/ THP=never.

Background
==========
Memory overcommit can increase utilization and, if carried out
properly, can also increase throughput. The challenges are to improve
working set estimation and to optimize page reclaim. The risks are
performance degradations and OOM kills. Short of overcoming the
challenges, the only way to reduce the risks is to underutilize
memory.

Memcached is one of the most popular open-source in-memory KV stores.
memtier_benchmark is the leading open-source KV store benchmarking
software that supports multiple access patterns. THP can have a
negative effect under memory pressure, due to internal and/or
external fragmentations.

Matrix
======
Kernels: version [+ patchset]
* Baseline: 5.14
* Patched: 5.14 + MGLRU

Memory conditions: % of memory size
* Underutilizing: N/A
* Overcommitting: ~10% swapped out (zram)

THP (2MB Transparent Huge Pages):
* Always
* Never

Read patterns (2kB objects):
* Parallel sequential
* Uniform random
* Gaussian (SD = 1/6 of key range)

Total configurations: 12
Data points per configuration: 10
Total run duration (minutes) per data point: ~20

Note that the goal of this benchmark is to compare the performance
for the same key range, object size, and hit ratio. Since Memcached
does not support backing storage, it requires fewer in-memory objects
to underutilize memory, which reduces the hit ratio and therefore is
not applicable in this case.

Procedure
=========
The latest MGLRU patchset for the 5.14 kernel is available at
git fetch https://linux-mm.googlesource.com/page-reclaim \
    refs/changes/30/1430/1

Baseline and patched 5.14 kernel images are available at
https://drive.google.com/drive/folders/1eMkQleAFGkP2vzM_JyRA21oKE0ESHBqp

<install and configure OS>

<for each kernel>
    grub2-set-default <baseline, patched>
    <for each THP setting>
        echo <always, never> > \
            /sys/kernel/mm/transparent_hugepage/enabled
        <update /etc/sysconfig/memcached>
        <for each access pattern>
            <update run_memtier.sh>
            <for each data point>
                reboot
                run_memtier.sh
                <collect OPS>

Hardware
========
Memory (GB): 64
CPU (total #): 32
NVMe SSD (GB): 1024

OS
==
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)

$ cat /proc/swaps
Filename          Type          Size          Used     Priority
/dev/zram0        partition     8388604       0        -2

$ cat /proc/cmdline
<existing parameters> systemd.unified_cgroup_hierarchy=1

$ cat /sys/fs/cgroup/user.slice/memory.min
4294967296

$ cat /proc/sys/vm/overcommit_memory
1

Memcached
=========
$ memcached -V
memcached 1.5.22

$ cat /etc/sysconfig/memcached
USER="memcached"
MAXCONN="10000"
CACHESIZE="65536"
OPTIONS="-s /tmp/memcached.sock -a 0766 -t 16 -b 10000 -B binary <-L>"
memtier_benchmark
$ memtier_benchmark -v
memtier_benchmark 1.3.0
Copyright (C) 2011-2020 Redis Labs Ltd.
This is free software.  You may redistribute copies of it under the
terms of
the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.

$ cat run_memtier.sh
# load objects
memtier_benchmark -S /tmp/memcached.sock -P memcache_binary -n
allkeys -c 1 -t 16 --ratio 1:0 --pipeline 1 -d 2000 --key-minimum=1
--key-maximum=30000000 --key-pattern=P:P

# run benchmark
memtier_benchmark -S /tmp/memcached.sock -P memcache_binary -n
30000000 -c 1 -t 16 --ratio 0:1 --pipeline 1 --randomize
--distinct-client-seed --key-minimum=1 --key-maximum=30000000
--key-pattern=<P:P, R:R, G:G>

Results
=======
Comparing the patched with the baseline kernel, Memcached achieved
95% CIs [23.54, 32.25]%, [20.76, 41.61]%, [13.85, 15.97]%, [21.59,
30.02]% and [23.94, 29.92]% more OPS, respectively, for sequential
access w/ THP=always, random access w/ THP=always, random access w/
THP=never, Gaussian access w/ THP=always and Gaussian access w/
THP=never. There were no statistically significant changes in OPS for
sequential access w/ THP=never.

+-------------------+-----------------------+------------------------+
| Mean OPS [95% CI] | THP=always            | THP=never              |
+-------------------+-----------------------+------------------------+
| Sequential access | 519599.7 / 664543.2   | 525394.8 / 527170.6    |
|                   | [122297.9, 167589.0]  | [-15138.63, 18690.31]  |
+-------------------+-----------------------+------------------------+
| Random access     | 450033.2 / 590360.7   | 509237.3 / 585142.4    |
|                   | [93415.59, 187239.37] | [70504.51, 81305.60]   |
+-------------------+-----------------------+------------------------+
| Gaussian access   | 481182.4 / 605358.7   | 531270.8 / 674341.4    |
|                   | [103892.6, 144460.0]] | [127199.8, 158941.2]   |
+-------------------+-----------------------+------------------------+
Table 1. Comparison between the baseline and patched kernels

Comparing THP=never with THP=always, Memcached achieved 95% CIs
[2.73, 23.58]% and [5.45, 15.37]% more OPS, respectively, for random
access and Gaussian access when using the baseline kernel; 95% CIs
[-22.65, -18.69]% and [10.67, 12.12]% more OPS, respectively, for
sequential access and Gaussian access when using the patched kernel.
There were no statistically significant changes in OPS under other
conditions.

+-------------------+-----------------------+------------------------+
| Mean OPS [95% CI] | Baseline kernel       |  Patched kernel        |
+-------------------+-----------------------+------------------------+
| Sequential access | 519599.7 / 525394.8   | 664543.2 / 527170.6    |
|                   | [-18739.71, 30329.80] | [-150551.0, -124194.1] |
+-------------------+-----------------------+------------------------+
| Random access     | 450033.2 / 509237.3   | 590360.7 / 585142.4    |
|                   | [12303.49, 106104.69] | [-10816.1516, 379.475] |
+-------------------+-----------------------+------------------------+
| Gaussian access   | 481182.4 / 531270.8   | 605358.7 / 674341.4    |
|                   | [26229.02, 73947.84]  | [64570.58, 73394.70]   |
+-------------------+-----------------------+------------------------+
Table 2. Comparison between THP=always and THP=never

Metrics collected during each run are available at
https://github.com/ediworks/KernelPerf/tree/master/mglru/memcached/5.14

References
==========
memtier_benchmark: A High-Throughput Benchmarking Tool for Redis &
Memcached
https://redis.com/blog/memtier_benchmark-a-high-throughput-benchmarking-tool-for-redis-memcached/

Appendix
========
$ cat raw_data.r
v <- c(
    # baseline THP=always sequential
    460266.29, 466497.70, 516145.38, 523474.39, 528507.72, 529481.86, 533867.92, 537028.56, 546027.45, 554699.89,
    # baseline THP=always random
    371470.66, 378967.63, 381137.01, 385205.60, 449100.72, 474670.76, 490470.46, 513341.53, 525159.49, 530808.55,
    # baseline THP=always Gaussian
    455674.14, 457089.50, 460001.46, 463269.94, 468283.00, 474169.61, 477684.67, 506331.96, 507875.30, 541444.54,
    # baseline THP=never sequential
    501887.04, 507303.10, 509573.54, 515222.79, 517429.04, 530805.74, 536490.44, 538088.45, 540459.92, 556687.57,
    # baseline THP=never random
    496489.97, 506444.42, 508002.80, 508707.39, 509746.28, 511157.58, 511897.57, 511926.06, 512652.28, 515348.95,
    # baseline THP=never Gaussian
    493199.15, 504207.48, 518781.40, 520536.21, 528619.45, 540677.91, 544365.57, 551698.32, 554046.80, 556576.14,
    # patched THP=always sequential
    660711.43, 660936.88, 661275.57, 662540.65, 663417.25, 665546.99, 665680.49, 667564.03, 668555.96, 669202.36,
    # patched THP=always random
    582574.69, 583714.04, 587102.54, 587375.85, 588997.85, 589052.96, 593922.17, 594722.98, 596178.28, 599965.83,
    # patched THP=always Gaussian
    601707.98, 602055.03, 603020.28, 603335.93, 604519.55, 605086.48, 607405.59, 607570.79, 609009.54, 609875.98,
    # patched THP=never sequential
    507753.56, 509462.65, 509964.30, 510369.66, 515001.36, 531685.00, 543709.22, 545142.98, 548392.56, 550224.74,
    # patched THP=never random
    571017.21, 579705.57, 582801.51, 584475.82, 586247.73, 587209.97, 587354.87, 588661.14, 591237.23, 592712.76,
    # patched THP=never Gaussian
    666403.77, 669691.68, 670248.43, 672190.97, 672466.43, 674320.42, 674897.72, 677282.76, 678886.51, 687024.85
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (thp in 1:2) {
    for (pattern in 1:3) {
        r <- t.test(a[, pattern, thp, 1], a[, pattern, thp, 2])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("thp%d pattern%d: no significance", thp, pattern)
        } else {
            s <- sprintf("thp%d pattern%d: [%.2f, %.2f]%%", thp, pattern, -p[2], -p[1])
        }
        print(s)
    }
}

# THP=always vs THP=never
for (kernel in 1:2) {
    for (pattern in 1:3) {
        r <- t.test(a[, pattern, 1, kernel], a[, pattern, 2, kernel])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("kernel%d pattern%d: no significance", kernel, pattern)
        } else {
            s <- sprintf("kernel%d pattern%d: [%.2f, %.2f]%%", kernel, pattern, -p[2], -p[1])
        }
        print(s)
    }
}

$ R -q -s -f raw_data.r

        Welch Two Sample t-test

data:  a[, pattern, thp, 1] and a[, pattern, thp, 2]
t = -14.434, df = 9.1861, p-value = 1.269e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -167589.0 -122297.9
sample estimates:
mean of x mean of y
 519599.7  664543.2

[1] "thp1 pattern1: [23.54, 32.25]%"

        Welch Two Sample t-test

data:  a[, pattern, thp, 1] and a[, pattern, thp, 2]
t = -6.7518, df = 9.1333, p-value = 7.785e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -187239.37  -93415.59
sample estimates:
mean of x mean of y
 450033.2  590360.7

[1] "thp1 pattern2: [20.76, 41.61]%"

        Welch Two Sample t-test

data:  a[, pattern, thp, 1] and a[, pattern, thp, 2]
t = -13.805, df = 9.1933, p-value = 1.866e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -144460.0 -103892.6
sample estimates:
mean of x mean of y
 481182.4  605358.7

[1] "thp1 pattern3: [21.59, 30.02]%"

        Welch Two Sample t-test

data:  a[, pattern, thp, 1] and a[, pattern, thp, 2]
t = -0.22059, df = 17.979, p-value = 0.8279
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -18690.31  15138.63
sample estimates:
mean of x mean of y
 525394.8  527170.6

[1] "thp2 pattern1: no significance"

        Welch Two Sample t-test

data:  a[, pattern, thp, 1] and a[, pattern, thp, 2]
t = -29.606, df = 17.368, p-value = 2.611e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -81305.60 -70504.51
sample estimates:
mean of x mean of y
 509237.3  585142.4

[1] "thp2 pattern2: [13.85, 15.97]%"

        Welch Two Sample t-test

data:  a[, pattern, thp, 1] and a[, pattern, thp, 2]
t = -20.02, df = 10.251, p-value = 1.492e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -158941.2 -127199.8
sample estimates:
mean of x mean of y
 531270.8  674341.4

[1] "thp2 pattern3: [23.94, 29.92]%"

        Welch Two Sample t-test

data:  a[, pattern, 1, kernel] and a[, pattern, 2, kernel]
t = -0.50612, df = 14.14, p-value = 0.6206
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -30329.80  18739.71
sample estimates:
mean of x mean of y
 519599.7  525394.8

[1] "kernel1 pattern1: no significance"

        Welch Two Sample t-test

data:  a[, pattern, 1, kernel] and a[, pattern, 2, kernel]
t = -2.8503, df = 9.1116, p-value = 0.01885
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -106104.69  -12303.49
sample estimates:
mean of x mean of y
 450033.2  509237.3

[1] "kernel1 pattern2: [2.73, 23.58]%"

        Welch Two Sample t-test

data:  a[, pattern, 1, kernel] and a[, pattern, 2, kernel]
t = -4.4308, df = 16.918, p-value = 0.0003701
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -73947.84 -26229.02
sample estimates:
mean of x mean of y
 481182.4  531270.8

[1] "kernel1 pattern3: [5.45, 15.37]%"

        Welch Two Sample t-test

data:  a[, pattern, 1, kernel] and a[, pattern, 2, kernel]
t = 23.374, df = 9.5538, p-value = 9.402e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 124194.1 150551.0
sample estimates:
mean of x mean of y
 664543.2  527170.6

[1] "kernel2 pattern1: [-22.65, -18.69]%"

        Welch Two Sample t-test

data:  a[, pattern, 1, kernel] and a[, pattern, 2, kernel]
t = 1.96, df = 17.806, p-value = 0.06583
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -379.4756 10816.1516
sample estimates:
mean of x mean of y
 590360.7  585142.4

[1] "kernel2 pattern2: no significance"

        Welch Two Sample t-test

data:  a[, pattern, 1, kernel] and a[, pattern, 2, kernel]
t = -33.687, df = 13.354, p-value = 2.614e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -73394.70 -64570.58
sample estimates:
mean of x mean of y
 605358.7  674341.4

[1] "kernel2 pattern3: [10.67, 12.12]%"
bot@edi.works Nov. 2, 2021, 12:20 a.m. UTC | #3
Kernel / Apache Spark benchmark with MGLRU

TLDR
====
With the MGLRU, Apache Spark took 95% CIs [9.28, 11.19]% and [12.20,
14.93]% less wall time to sort 3 billion random integers,
respectively, under the medium- and high-concurrency conditions when
slightly overcommitting memory. There were no statistically
significant changes in wall time when sorting the same dataset under
other conditions.

Background
==========
Memory overcommit can increase utilization and, if carried out
properly, can also increase throughput. The challenges are to improve
working set estimation and to optimize page reclaim. The risks are
performance degradations and OOM kills. Short of overcoming the
challenges, the only way to reduce the risks is to underutilize
memory.

Apache Spark is one of the most popular open-source big-data
frameworks. Dataset sorting is the most widely used benchmark for
such frameworks.

Matrix
======
Kernels: version [+ patchset]
* Baseline: 5.14
* Patched: 5.14 + MGLRU

Memory conditions: % of memory size
* Underutilizing: ~10% on inactive file list
* Overcommitting: ~10% swapped out

Concurrency conditions: average # of workers per CPU
* Low: 1
* Medium: 2
* High: 3

Cluster mode: local
Dataset size: 3 billion random integers (57GB text)

Total configurations: 12
Data points per configuration: 10
Total run duration (minutes) per data point: ~20

Procedure
=========
The latest MGLRU patchset for the 5.14 kernel is available at
git fetch https://linux-mm.googlesource.com/page-reclaim \
    refs/changes/30/1430/1

Baseline and patched 5.14 kernel images are available at
https://drive.google.com/drive/folders/1eMkQleAFGkP2vzM_JyRA21oKE0ESHBqp

<install and configure OS>
spark-shell < gen.scala

<for each kernel>
    grub2-set-default <baseline, patched>
    <for each memory condition>
        <update run_spark.sh>
        <for each concurrency condition>
            <update run_spark.sh>
            <for each data point>
                reboot
                run_spark.sh
                <collect wall time>

Hardware
========
Memory (GB): 64
CPU (total #): 32
NVMe SSD (GB): 1024

OS
==
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)

$ cat /proc/swaps
Filename          Type          Size          Used     Priority
/dev/nvme0n1p3    partition     32970748      0        -2

$ cat /proc/cmdline
<existing parameters> systemd.unified_cgroup_hierarchy=1

$ cat /sys/fs/cgroup/user.slice/memory.min
4294967296

$ cat /proc/sys/vm/overcommit_memory
1

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

Apache Spark
============
$ spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.12
Branch HEAD
Compiled by user centos on 2021-05-24T04:27:48Z
Revision de351e30a90dd988b133b3d00fa6218bfcaba8b8
Url https://github.com/apache/spark
Type --help for more information.

$ cat gen.scala
import java.io._
import scala.collection.mutable.ArrayBuffer

object GenData {
    def main(args: Array[String]): Unit = {
        val file = new File("dataset.txt")
        val writer = new BufferedWriter(new FileWriter(file))
        val buf = ArrayBuffer(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)
        for(_ <- 0 until 300000000) {
            for (i <- 0 until 10) {
                buf.update(i, scala.util.Random.nextLong())
            }
            writer.write(s"${buf.mkString(",")}\n")
        }
        writer.close()
    }
}
GenData.main(Array())

$ cat sort.scala
import java.time.temporal.ChronoUnit
import org.apache.spark.sql.SparkSession

object SparkSort {
    def main(args: Array[String]): Unit = {
        val spark = SparkSession.builder().getOrCreate()
        val file = sc.textFile("dataset.txt", 32)
        val start = java.time.Instant.now()
        val results = file.flatMap(_.split(",")).map(x => (x, 1)).sortByKey().takeOrdered(10)
        val finish = java.time.Instant.now()
        println(s"wall time: ${ChronoUnit.SECONDS.between(start, finish)}")
        results.foreach(println)
        spark.stop()
    }
}
SparkSort.main(Array())

$ cat run_spark.sh
spark-shell --master local\[<32, 64, 96>\] --driver-memory <52G, 62G> < sort.scala

Results
=======
Comparing the patched with the baseline kernel, Apache Spark took 95%
CIs [9.28, 11.19]% and [12.20, 14.93]% less wall time to sort the
dataset, respectively, under the medium- and high-concurrency
conditions when slightly overcommitting memory. There were no
statistically significant changes in wall time under other conditions.

+--------------------+-----------------------+-----------------------+
| Mean wall time (s) | Underutilizing memory | Overcommitting memory |
| [95% CI]           |                       |                       |
+--------------------+-----------------------+-----------------------+
| Low concurrency    | 1037.1 / 1037.0       | 1038.2 / 1036.6       |
|                    | [-1.41, 1.21]         | [-3.67, 0.47]         |
+--------------------+-----------------------+-----------------------+
| Medium concurrency | 1141.8 / 1142.6       | 1297.9 / 1165.1       |
|                    | [-1.35, 2.95]         | [-145.21, -120.38]    |
+--------------------+-----------------------+-----------------------+
| High concurrency   | 1239.3 / 1236.4       | 1456.8 / 1259.2       |
|                    | [-7.81, 2.01]         | [-217.53, -177.66]    |
+--------------------+-----------------------+-----------------------+
Table 1. Comparison between the baseline and patched kernels

Comparing overcommitting with underutilizing memory, Apache Spark
took 95% CIs [12.58, 14.76]% and [15.95, 19.15]% more wall time to
sort the dataset, respectively, under the low- and medium-concurrency
conditions when using the baseline kernel; 95% CIs [1.78, 2.16]% and
[1.42, 2.27]% more wall time, respectively, under the medium- and
high-concurrency conditions when using the patched kernel. There were
no statistically significant changes in wall time under other
conditions.

+--------------------+------------------------+----------------------+
| Mean wall time (s) | Baseline kernel        | Patched kernel       |
| [95% CI]           |                        |                      |
+--------------------+------------------------+----------------------+
| Low concurrency    | 1037.1 / 1038.2        | 1037.0 / 1036.6      |
|                    | [-0.31, 2.51]          | [-2.43, 1.63]        |
+--------------------+------------------------+----------------------+
| Medium concurrency | 1141.8 / 1297.9        | 1142.6 / 1165.1      |
|                    | [143.68, 168.51]       | [20.33, 24.66]       |
+--------------------+------------------------+----------------------+
| High concurrency   | 1239.3 / 1456.8        | 1236.4 / 1259.2      |
|                    | [197.62, 237.37]       | [17.55, 28.04]       |
+--------------------+------------------------+----------------------+
Table 2. Comparison between underutilizing and overcommitting memory

Metrics collected during each run are available at
https://github.com/ediworks/KernelPerf/tree/master/mglru/spark/5.14

Appendix
========
$ cat raw_data_spark.r
v <- c(
    # baseline 52g 32t
    1034, 1036, 1036, 1037, 1037, 1037, 1038, 1038, 1038, 1040,
    # baseline 52g 64t
    1139, 1139, 1140, 1140, 1142, 1143, 1143, 1144, 1144, 1144,
    # baseline 52g 96t
    1236, 1237, 1238, 1238, 1238, 1239, 1240, 1241, 1243, 1243,
    # baseline 62g 32t
    1036, 1036, 1038, 1038, 1038, 1038, 1039, 1039, 1040, 1040,
    # baseline 62g 64t
    1266, 1277, 1284, 1296, 1299, 1302, 1311, 1313, 1314, 1317,
    # baseline 62g 96t
    1403, 1431, 1440, 1447, 1460, 1461, 1467, 1475, 1487, 1497,
    # patched 52g 32t
    1035, 1036, 1036, 1037, 1037, 1037, 1037, 1038, 1038, 1039,
    # patched 52g 64t
    1138, 1140, 1140, 1143, 1143, 1143, 1144, 1145, 1145, 1145,
    # patched 52g 96t
    1228, 1228, 1233, 1234, 1235, 1236, 1236, 1240, 1246, 1248,
    # patched 62g 32t
    1032, 1035, 1035, 1035, 1036, 1036, 1037, 1039, 1040, 1041,
    # patched 62g 64t
    1162, 1164, 1164, 1164, 1164, 1164, 1166, 1166, 1168, 1169,
    # patched 62g 96t
    1252, 1256, 1256, 1258, 1260, 1260, 1260, 1260, 1265, 1265
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (mem in 1:2) {
    for (con in 1:3) {
        r <- t.test(a[, con, mem, 1], a[, con, mem, 2])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("mem%d con%d: no significance", mem, con)
        } else {
            s <- sprintf("mem%d con%d: [%.2f, %.2f]%%", mem, con, -p[2], -p[1])
        }
        print(s)
    }
}

# 52g vs 62g
for (ker in 1:2) {
    for (con in 1:3) {
        r <- t.test(a[, con, 1, ker], a[, con, 2, ker])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("ker%d con%d: no significance", ker, con)
        } else {
            s <- sprintf("ker%d con%d: [%.2f, %.2f]%%", ker, con, -p[2], -p[1])
        }
        print(s)
    }
}

$ R -q -s -f raw_data_spark.r

        Welch Two Sample t-test

data:  a[, con, mem, 1] and a[, con, mem, 2]
t = 0.16059, df = 16.4, p-value = 0.8744
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.21749  1.41749
sample estimates:
mean of x mean of y
   1037.1    1037.0

[1] "mem1 con1: no significance"

        Welch Two Sample t-test

data:  a[, con, mem, 1] and a[, con, mem, 2]
t = -0.78279, df = 17.565, p-value = 0.4442
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.950923  1.350923
sample estimates:
mean of x mean of y
   1141.8    1142.6

[1] "mem1 con2: no significance"

        Welch Two Sample t-test

data:  a[, con, mem, 1] and a[, con, mem, 2]
t = 1.2933, df = 11.303, p-value = 0.2217
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.019103  7.819103
sample estimates:
mean of x mean of y
   1239.3    1236.4

[1] "mem1 con3: no significance"

        Welch Two Sample t-test

data:  a[, con, mem, 1] and a[, con, mem, 2]
t = 1.6562, df = 13.458, p-value = 0.1208
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.4799188  3.6799188
sample estimates:
mean of x mean of y
   1038.2    1036.6

[1] "mem2 con1: no significance"

        Welch Two Sample t-test

data:  a[, con, mem, 1] and a[, con, mem, 2]
t = 24.096, df = 9.2733, p-value = 1.115e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 120.3881 145.2119
sample estimates:
mean of x mean of y
   1297.9    1165.1

[1] "mem2 con2: [-11.19, -9.28]%"

        Welch Two Sample t-test

data:  a[, con, mem, 1] and a[, con, mem, 2]
t = 22.289, df = 9.3728, p-value = 1.944e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 177.6666 217.5334
sample estimates:
mean of x mean of y
   1456.8    1259.2

[1] "mem2 con3: [-14.93, -12.20]%"

        Welch Two Sample t-test

data:  a[, con, 1, ker] and a[, con, 2, ker]
t = -1.6398, df = 17.697, p-value = 0.1187
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.5110734  0.3110734
sample estimates:
mean of x mean of y
   1037.1    1038.2

[1] "ker1 con1: no significance"

        Welch Two Sample t-test

data:  a[, con, 1, ker] and a[, con, 2, ker]
t = -28.33, df = 9.2646, p-value = 2.57e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -168.5106 -143.6894
sample estimates:
mean of x mean of y
   1141.8    1297.9

[1] "ker1 con2: [12.58, 14.76]%"

        Welch Two Sample t-test

data:  a[, con, 1, ker] and a[, con, 2, ker]
t = -24.694, df = 9.1353, p-value = 1.12e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -237.3794 -197.6206
sample estimates:
mean of x mean of y
   1239.3    1456.8

[1] "ker1 con3: [15.95, 19.15]%"

        Welch Two Sample t-test

data:  a[, con, 1, ker] and a[, con, 2, ker]
t = 0.42857, df = 12.15, p-value = 0.6757
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.630775  2.430775
sample estimates:
mean of x mean of y
   1037.0    1036.6

[1] "ker2 con1: no significance"

        Welch Two Sample t-test

data:  a[, con, 1, ker] and a[, con, 2, ker]
t = -21.865, df = 17.646, p-value = 3.151e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -24.66501 -20.33499
sample estimates:
mean of x mean of y
   1142.6    1165.1

[1] "ker2 con2: [1.78, 2.16]%"

        Welch Two Sample t-test

data:  a[, con, 1, ker] and a[, con, 2, ker]
t = -9.2738, df = 14.72, p-value = 1.561e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -28.04897 -17.55103
sample estimates:
mean of x mean of y
   1236.4    1259.2

[1] "ker2 con3: [1.42, 2.27]%"
bot@edi.works Nov. 9, 2021, 2:13 a.m. UTC | #4
Kernel / MongoDB benchmark with MGLRU

TLDR
====
With the MGLRU, MongoDB achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]%
and [2.16, 3.55]% more operations per second (OPS) respectively for
exponential (distribution) access, random access and Zipfian access,
when underutizling memory; 95% CIs [8.83, 10.03]%, [21.12, 23.14]%
and [5.53, 6.46]% more OPS respectively for exponential access,
random access and Zipfian access, when slightly overcommitting memory.

Background
==========
Memory overcommit can increase utilization and, if carried out
properly, can also increase throughput. The challenges are to improve
working set estimation and to optimize page reclaim. The risks are
performance degradation and OOM kills. Short of overcoming the
challenges, the only way to reduce the risks is to underutilize
memory.

MongoDB is one of the most popular open-source NoSQL databases. YCSB
is the leading open-source NoSQL database benchmarking software that
supports multiple access distributions.

Matrix
======
Kernels: version [+ patchset]
* Baseline: 5.14
* Patched: 5.14 + MGLRU

Memory utilization: % of memory size
* Underutilizing: ~15% on inactive file list
* Overcommitting: ~5% swapped out

Concurrency: average # of users per CPU
* Medium: 2

Access distributions (1kB objects, 20% update):
* Exponential
* Uniform random
* Zipfian

Total configurations: 12
Data points per configuration: 10
Total run duration (minutes) per data point: ~20

Note that MongoDB reached the peak performance with the concurrency
for this benchmark, i.e., its performance degraded with fewer or more
users for this benchmark.

Procedure
=========
The latest MGLRU patchset for the 5.14 kernel is available at
git fetch https://linux-mm.googlesource.com/page-reclaim \
    refs/changes/30/1430/1

Baseline and patched 5.14 kernel images are available at
https://drive.google.com/drive/folders/1eMkQleAFGkP2vzM_JyRA21oKE0ESHBqp

<install and configure OS>
ycsb_load.sh
systemctl stop mongod
e2image <backup /mnt/data>

<for each kernel>
    grub2-set-default <baseline, patched>
    <for each memory utilization>
        <update /etc/mongod.conf>
        <for each access distribution>
            <update ycsb_run.sh>
            <for each data point>
                systemctl stop mongod
                e2image <restore /mnt/data>
                reboot
                ycsb_run.sh
                <collect OPS>

Hardware
========
Memory (GB): 64
CPU (total #): 32
NVMe SSD (GB): 1024

OS
==
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)

$ cat /proc/swaps
Filename          Type          Size          Used     Priority
/dev/nvme0n1p3    partition     32970748      0        -2

$ cat /proc/cmdline
<existing parameters> systemd.unified_cgroup_hierarchy=1

$ cat /sys/fs/cgroup/user.slice/memory.min
4294967296

$ cat /proc/sys/vm/overcommit_memory
1

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

MongoDB
=======
$ mongod --version
db version v5.0.3
Build Info: {
    "version": "5.0.3",
    "gitVersion": "657fea5a61a74d7a79df7aff8e4bcf0bc742b748",
    "openSSLVersion": "OpenSSL 1.1.1g FIPS  21 Apr 2020",
    "modules": [],
    "allocator": "tcmalloc",
    "environment": {
        "distmod": "rhel80",
        "distarch": "x86_64",
        "target_arch": "x86_64"
    }
}

$ cat /etc/mongod.conf
# mongod.conf
<existing parameters>

# Where and how to store data.
storage:
  dbPath: /mnt/data
  journal:
    enabled: true
  wiredTiger:
    engineConfig:
      cacheSizeGB: <50, 60>

<existing parameters>

YCSB
====
$ git log
commit ce3eb9ce51c84ee9e236998cdd2cefaeb96798a8 (HEAD -> master,
origin/master, origin/HEAD)
Author: Ivan <john.koepi@gmail.com>
Date:   Tue Feb 16 17:38:00 2021 +0200

    [scylla] enable token aware LB by default, improve the docs (#1507)

$ cat ycsb_load.sh
# load objects
ycsb load mongodb -s -threads 16 \
    -p mongodb.url=mongodb://%2Ftmp%2Fmongodb-27017.sock \
    -p workload=site.ycsb.workloads.CoreWorkload \
    -p recordcount=80000000

$ cat ycsb_run.sh
# run benchmark
ycsb run mongodb -s -threads 64 \
    -p mongodb.url=mongodb://%2Ftmp%2Fmongodb-27017.sock \
    -p workload=site.ycsb.workloads.CoreWorkload \
    -p recordcount=80000000 -p operationcount=80000000 \
    -p readproportion=0.8 -p updateproportion=0.2 \
    -p requestdistribution=<exponential, uniform, zipfian>

Results
=======
Comparing the patched with the baseline kernel, MongoDB achieved 95%
CIs [2.23, 3.44]%, [6.97, 9.73]% and [2.16, 3.55]% more OPS
respectively for exponential access, random access and Zipfian
access, when underutizling memory; 95% CIs [8.83, 10.03]%, [21.12,
23.14]% and [5.53, 6.46]% more OPS respectively for exponential
access, random access and Zipfian access, when slightly
overcommitting memory.

+--------------------+-----------------------+-----------------------+
| Mean OPS [95% CI]  | Underutilizing memory | Overcommitting memory |
+--------------------+-----------------------+-----------------------+
| Exponential access | 76615.56 / 78788.76   | 73984.90 / 80961.66   |
|                    | [1708.76, 2637.62]    | [6533.94, 7419.58]    |
+--------------------+-----------------------+-----------------------+
| Random access      | 62093.40 / 67276.01   | 55990.56 / 68379.91   |
|                    | [4324.96, 6040.25]    | [11824.09, 12954.62]  |
+--------------------+-----------------------+-----------------------+
| Zipfian access     | 92532.25 / 95174.43   | 93545.62 / 99151.12   |
|                    | [1997.20, 3287.17]    | [5171.27, 6039.72]    |
+--------------------+-----------------------+-----------------------+
Table 1. Comparison between the baseline and patched kernels

Comparing overcommitting with underutilizing memory, MongoDB achieved
95% CIs [-4.10, -2.77]%, [-11.20, -8.46]% and [0.36, 1.83]% more OPS
respectively for exponential access, random access and Zipfian
access, when using the baseline kernel; 95% CIs [2.27, 3.25]%, [0.78,
2.50]% and [3.81, 4.54]% more OPS respectively for exponential
access, random access and Zipfian access, when using the patched
kernel.

+--------------------+-----------------------+-----------------------+
| Mean OPS [95% CI]  | Baseline kernel       |  Patched kernel       |
+--------------------+-----------------------+-----------------------+
| Exponential access | 76615.56 / 73984.90   | 78788.76 / 80961.66   |
|                    | [-3139.12, -2122.20]  | [1786.70, 2559.09]    |
+--------------------+-----------------------+-----------------------+
| Random access      | 62093.40 / 55990.56   | 67276.01 / 68379.91   |
|                    | [-6953.44, -5252.23]  | [525.42, 1682.38]     |
+--------------------+-----------------------+-----------------------+
| Zipfian access     | 92532.25 / 93545.62   | 95174.43 / 99151.12   |
|                    | [330.99, 1695.75]     | [3628.31, 4325.06]    |
+--------------------+-----------------------+-----------------------+
Table 2. Comparison between underutilizing and overcommitting memory

Metrics collected during each run are available at
https://github.com/ediworks/KernelPerf/tree/master/mglru/mongodb/5.14

Appendix
========
$ cat raw_data_mongodb.r
v <- c(
    # baseline 50g exp
    75814.86, 75884.91, 76052.71, 76621.01, 76641.19, 76661.24, 76870.15, 77017.79, 77289.08, 77302.67,
    # baseline 50g uni
    60638.17, 60968.91, 61128.61, 61548.40, 61779.30, 61917.58, 62152.28, 63440.15, 63625.47, 63735.11,
    # baseline 50g zip
    91271.16, 91482.41, 91524.17, 92467.16, 92585.62, 92843.29, 92885.65, 93229.98, 93408.94, 93624.08,
    # baseline 60g exp
    73183.67, 73191.30, 73527.58, 73831.79, 74047.95, 74056.24, 74401.23, 74418.53, 74547.58, 74643.08,
    # baseline 60g uni
    55175.76, 55477.42, 55605.52, 55680.21, 55903.39, 56171.05, 56375.06, 56380.43, 56509.94, 56626.78,
    # baseline 60g zip
    92653.82, 92775.02, 93100.44, 93290.21, 93593.74, 93775.64, 93868.72, 93915.12, 94194.77, 94288.69,
    # patched 50g exp
    78349.95, 78385.64, 78392.33, 78419.91, 78726.59, 78738.68, 78930.72, 78948.25, 79404.38, 79591.14,
    # patched 50g uni
    66622.91, 66667.33, 66951.43, 67104.80, 67117.30, 67196.90, 67389.75, 67406.62, 68131.43, 68171.61,
    # patched 50g zip
    94261.14, 94822.34, 94914.70, 95114.89, 95156.75, 95205.90, 95383.78, 95612.00, 95624.00, 95648.81,
    # patched 60g exp
    80272.04, 80612.33, 80679.23, 80717.74, 81011.18, 81029.64, 81146.68, 81371.84, 81379.13, 81396.76,
    # patched 60g uni
    67559.52, 67600.11, 67718.90, 68062.57, 68278.78, 68446.56, 68452.82, 68853.86, 69278.34, 69547.67,
    # patched 60g zip
    98706.81, 98864.41, 98903.77, 99044.10, 99155.68, 99162.94, 99165.64, 99482.31, 99484.91, 99540.62
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (mem in 1:2) {
    for (dist in 1:3) {
        r <- t.test(a[, dist, mem, 1], a[, dist, mem, 2])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("mem%d dist%d: no significance", mem, dist)
        } else {
            s <- sprintf("mem%d dist%d: [%.2f, %.2f]%%", mem, dist, -p[2], -p[1])
        }
        print(s)
    }
}

# 50g vs 60g
for (kern in 1:2) {
    for (dist in 1:3) {
        r <- t.test(a[, dist, 1, kern], a[, dist, 2, kern])
        print(r)

p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("kern%d dist%d: no significance", kern, dist)
        } else {
            s <- sprintf("kern%d dist%d: [%.2f, %.2f]%%", kern, dist, -p[2], -p[1])
        }
        print(s)
    }
}

$ R -q -s -f raw_data_mongodb.r

        Welch Two Sample t-test

data:  a[, dist, mem, 1] and a[, dist, mem, 2]
t = -9.8624, df = 17.23, p-value = 1.671e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2637.627 -1708.769
sample estimates:
mean of x mean of y
 76615.56  78788.76

[1] "mem1 dist1: [2.23, 3.44]%"

        Welch Two Sample t-test

data:  a[, dist, mem, 1] and a[, dist, mem, 2]
t = -13.081, df = 12.744, p-value = 9.287e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6040.256 -4324.964
sample estimates:
mean of x mean of y
 62093.40  67276.01

[1] "mem1 dist2: [6.97, 9.73]%"

        Welch Two Sample t-test

data:  a[, dist, mem, 1] and a[, dist, mem, 2]
t = -8.8194, df = 13.459, p-value = 5.833e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3287.17 -1997.20
sample estimates:
mean of x mean of y
 92532.25  95174.43

[1] "mem1 dist3: [2.16, 3.55]%"

        Welch Two Sample t-test

data:  a[, dist, mem, 1] and a[, dist, mem, 2]
t = -33.368, df = 16.192, p-value = 2.329e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7419.582 -6533.942
sample estimates:
mean of x mean of y
 73984.90  80961.66

[1] "mem2 dist1: [8.83, 10.03]%"

        Welch Two Sample t-test

data:  a[, dist, mem, 1] and a[, dist, mem, 2]
t = -46.386, df = 16.338, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -12954.62 -11824.09
sample estimates:
mean of x mean of y
 55990.56  68379.91

[1] "mem2 dist2: [21.12, 23.14]%"

        Welch Two Sample t-test

data:  a[, dist, mem, 1] and a[, dist, mem, 2]
t = -27.844, df = 13.209, p-value = 4.049e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6039.729 -5171.275
sample estimates:
mean of x mean of y
 93545.62  99151.12

[1] "mem2 dist3: [5.53, 6.46]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 10.87, df = 18, p-value = 2.439e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2122.207 3139.125
sample estimates:
mean of x mean of y
 76615.56  73984.90

[1] "kern1 dist1: [-4.10, -2.77]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 15.593, df = 12.276, p-value = 1.847e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 5252.237 6953.447
sample estimates:
mean of x mean of y
 62093.40  55990.56

[1] "kern1 dist2: [-11.20, -8.46]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = -3.1512, df = 15.811, p-value = 0.006252
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1695.7509  -330.9911
sample estimates:
mean of x mean of y
 92532.25  93545.62

[1] "kern1 dist3: [0.36, 1.83]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = -11.836, df = 17.672, p-value = 7.84e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2559.092 -1786.704
sample estimates:
mean of x mean of y
 78788.76  80961.66

[1] "kern2 dist1: [2.27, 3.25]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = -4.0276, df = 16.921, p-value = 0.0008807
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1682.3864  -525.4236
sample estimates:
mean of x mean of y
 67276.01  68379.91

[1] "kern2 dist2: [0.78, 2.50]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = -24.26, df = 15.517, p-value = 9.257e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4325.062 -3628.314
sample estimates:
mean of x mean of y
 95174.43  99151.12

[1] "kern2 dist3: [3.81, 4.54]%"