diff mbox series

[v10,13/14] mm: multi-gen LRU: admin guide

Message ID 20220407031525.2368067-14-yuzhao@google.com (mailing list archive)
State New
Headers show
Series Multi-Gen LRU Framework | expand

Commit Message

Yu Zhao April 7, 2022, 3:15 a.m. UTC
Add an admin guide.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 Documentation/admin-guide/mm/index.rst        |   1 +
 Documentation/admin-guide/mm/multigen_lru.rst | 152 ++++++++++++++++++
 mm/Kconfig                                    |   3 +-
 3 files changed, 155 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst

Comments

Bagas Sanjaya April 7, 2022, 12:41 p.m. UTC | #1
On 07/04/22 10.15, Yu Zhao wrote:
> Add an admin guide.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

Why this documentation be added to admin-guide?
Jonathan Corbet April 7, 2022, 12:51 p.m. UTC | #2
Bagas Sanjaya <bagasdotme@gmail.com> writes:

> On 07/04/22 10.15, Yu Zhao wrote:
>> Add an admin guide.
>
> Why this documentation be added to admin-guide?

It appears to describe various parameters that a system administrator
can tweak to affect how MGLRU works on their system.  So the admin guide
seems to be the right place.

Bagas, why are you questioning this?

Thanks,

jon
Andrew Morton April 12, 2022, 2:16 a.m. UTC | #3
On Wed,  6 Apr 2022 21:15:25 -0600 Yu Zhao <yuzhao@google.com> wrote:

> Add an admin guide.
> 
>
> ...
>
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/multigen_lru.rst
> @@ -0,0 +1,152 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +Multi-Gen LRU
> +=============
> +The multi-gen LRU is an alternative LRU implementation that optimizes
> +page reclaim and improves performance under memory pressure. Page
> +reclaim decides the kernel's caching policy and ability to overcommit
> +memory. It directly impacts the kswapd CPU usage and RAM efficiency.
> +
> +Quick start
> +===========
> +Build the kernel with the following configurations.
> +
> +* ``CONFIG_LRU_GEN=y``
> +* ``CONFIG_LRU_GEN_ENABLED=y``
> +
> +All set!
> +
> +Runtime options
> +===============
> +``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
> +following subsections.
> +
> +Kill switch
> +-----------
> +``enable`` accepts different values to enable or disable the following

It's actually called "enabled".  And I suggest that the file name be
included right there in the title.  ie.

"enabled": Kill Switch
======================

> +components. Its default value depends on ``CONFIG_LRU_GEN_ENABLED``.
> +All the components should be enabled unless some of them have
> +unforeseen side effects. Writing to ``enable`` has no effect when a
> +component is not supported by the hardware, and valid values will be
> +accepted even when the main switch is off.
> +
> +====== ===============================================================
> +Values Components
> +====== ===============================================================
> +0x0001 The main switch for the multi-gen LRU.
> +0x0002 Clearing the accessed bit in leaf page table entries in large
> +       batches, when MMU sets it (e.g., on x86). This behavior can
> +       theoretically worsen lock contention (mmap_lock). If it is
> +       disabled, the multi-gen LRU will suffer a minor performance
> +       degradation.
> +0x0004 Clearing the accessed bit in non-leaf page table entries as
> +       well, when MMU sets it (e.g., on x86). This behavior was not
> +       verified on x86 varieties other than Intel and AMD. If it is
> +       disabled, the multi-gen LRU will suffer a negligible
> +       performance degradation.
> +[yYnN] Apply to all the components above.
> +====== ===============================================================
> +
> +E.g.,
> +::
> +
> +    echo y >/sys/kernel/mm/lru_gen/enabled
> +    cat /sys/kernel/mm/lru_gen/enabled
> +    0x0007
> +    echo 5 >/sys/kernel/mm/lru_gen/enabled
> +    cat /sys/kernel/mm/lru_gen/enabled
> +    0x0005
> +
> +Thrashing prevention
> +--------------------
> +Personal computers are more sensitive to thrashing because it can
> +cause janks (lags when rendering UI) and negatively impact user
> +experience. The multi-gen LRU offers thrashing prevention to the
> +majority of laptop and desktop users who do not have ``oomd``.
> +
> +Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
> +``N`` milliseconds from getting evicted. The OOM killer is triggered
> +if this working set cannot be kept in memory. In other words, this
> +option works as an adjustable pressure relief valve, and when open, it
> +terminates applications that are hopefully not being used.
> +
> +Based on the average human detectable lag (~100ms), ``N=1000`` usually
> +eliminates intolerable janks due to thrashing. Larger values like
> +``N=3000`` make janks less noticeable at the risk of premature OOM
> +kills.

This is a quite useful user guide.

> +The default value ``0`` means disabled.
> +
> +Experimental features
> +=====================
> +``/sys/kernel/debug/lru_gen`` accepts commands described in the
> +following subsections. Multiple command lines are supported, so does
> +concatenation with delimiters ``,`` and ``;``.
> +
> +``/sys/kernel/debug/lru_gen_full`` provides additional stats for
> +debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
> +evicted generations in this file.
> +
> +Working set estimation
> +----------------------
> +Working set estimation measures how much memory an application
> +requires in a given time interval, and it is usually done with little
> +impact on the performance of the application. E.g., data centers want
> +to optimize job scheduling (bin packing) to improve memory
> +utilizations. When a new job comes in, the job scheduler needs to find
> +out whether each server it manages can allocate a certain amount of
> +memory for this new job before it can pick a candidate. To do so, this
> +job scheduler needs to estimate the working sets of the existing jobs.

These various sysfs interfaces are a big deal.  Because they are so
hard to change once released.

So I think it would be good to present them to reviewers in a more
detailed fashion.  In the changelog for the patch which adds the
interface, provide full sample output and a description of all the
fields which are shown.

> +When it is read, ``lru_gen`` returns a histogram of numbers of pages
> +accessed over different time intervals for each memcg and node.
> +``MAX_NR_GENS`` decides the number of bins for each histogram.
> +::
> +
> +    memcg  memcg_id  memcg_path
> +       node  node_id
> +           min_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
> +           ...
> +           max_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
> +
> +Each generation contains an estimated number of pages that have been
> +accessed within ``age_in_ms`` non-cumulatively. E.g., ``min_gen_nr``
> +contains the coldest pages and ``max_gen_nr`` contains the hottest
> +pages, since ``age_in_ms`` of the former is the largest and that of
> +the latter is the smallest.
> +
> +Users can write ``+ memcg_id node_id max_gen_nr
> +[can_swap[full_scan]]`` to ``lru_gen`` to create a new generation
> +``max_gen_nr+1``. ``can_swap`` defaults to the swap setting and, if it
> +is set to ``1``, it forces the scan of anon pages when swap is off.
> +``full_scan`` defaults to ``1`` and, if it is set to ``0``, it reduces
> +the overhead as well as the coverage when scanning page tables.
> +
> +A typical use case is that a job scheduler writes to ``lru_gen`` at a
> +certain time interval to create new generations, and it ranks the
> +servers it manages based on the sizes of their cold memory defined by
> +this time interval.

Are you really confident that this interface will be in this same exact
form 10 years from now?  That we will never have cause to change,
extend or remove it?

If "no" then perhaps we shouldn't add it.

btw, what is this "job scheduler" of which you speak?  Is there an open
source implementation upon which we hope the world will converge?

> +Proactive reclaim
> +-----------------
> +Proactive reclaim induces memory reclaim when there is no memory
> +pressure and usually targets cold memory only. E.g., when a new job
> +comes in, the job scheduler wants to proactively reclaim memory on the
> +server it has selected to improve the chance of successfully landing
> +this new job.
> +
> +Users can write ``- memcg_id node_id min_gen_nr [swappiness
> +[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or
> +equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than
> +``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully
> +aged and therefore cannot be evicted. ``swappiness`` overrides the
> +default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits
> +the number of pages to evict.
> +
> +A typical use case is that a job scheduler writes to ``lru_gen``
> +before it tries to land a new job on a server, and if it fails to
> +materialize the cold memory without impacting the existing jobs on
> +this server, it retries on the next server according to the ranking
> +result obtained from the working set estimation step described
> +earlier.

It sounds to me that these interfaces were developed in response to
ongoing development and use of a particular job scheduler.

This is a very good thing, but has thought been given to the potential
needs of other job schedulers?
Yu Zhao April 16, 2022, 2:22 a.m. UTC | #4
On Mon, Apr 11, 2022 at 8:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed,  6 Apr 2022 21:15:25 -0600 Yu Zhao <yuzhao@google.com> wrote:
>
> > +Kill switch
> > +-----------
> > +``enable`` accepts different values to enable or disable the following
>
> It's actually called "enabled".

Good catch. Thanks!

> And I suggest that the file name be
> included right there in the title.  ie.
>
> "enabled": Kill Switch
> ======================

Will do.

> > +Experimental features
> > +=====================
> > +``/sys/kernel/debug/lru_gen`` accepts commands described in the
> > +following subsections. Multiple command lines are supported, so does
> > +concatenation with delimiters ``,`` and ``;``.
> > +
> > +``/sys/kernel/debug/lru_gen_full`` provides additional stats for
> > +debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
> > +evicted generations in this file.
> > +
> > +Working set estimation
> > +----------------------
> > +Working set estimation measures how much memory an application
> > +requires in a given time interval, and it is usually done with little
> > +impact on the performance of the application. E.g., data centers want
> > +to optimize job scheduling (bin packing) to improve memory
> > +utilizations. When a new job comes in, the job scheduler needs to find
> > +out whether each server it manages can allocate a certain amount of
> > +memory for this new job before it can pick a candidate. To do so, this
> > +job scheduler needs to estimate the working sets of the existing jobs.
>
> These various sysfs interfaces are a big deal.  Because they are so
> hard to change once released.

Debugfs, not sysfs. The title is "Experimental features" :)

> btw, what is this "job scheduler" of which you speak?

Basically it's part of cluster management software. Many jobs
(programs + data) can run concurrently in the same cluster and the job
scheduler of this cluster does the bin packing. To improve resource
utilization, the job scheduler needs to know the (memory) size of each
job it packs, hence the working set estimation (how much memory a job
uses within a given time interval). The job scheduler also takes
memory from some jobs so that those jobs can better fit into a single
machine (proactive reclaim).

> Is there an open
> source implementation upon which we hope the world will converge?

There are many [1], e.g., Kubernetes (k8s). Personally, I don't think
they'll ever converge.

At the moment, all open source implementations I know of rely on users
manually specifying the size of each job (job spec), e.g., [2]. Users
overprovision memory to avoid OOM kills. The average memory
utilization generally is surprisingly low. What we can hope for is
that eventually some of the open source implementations will use the
working set estimation and proactive reclaim features provided here.

[1] https://en.wikipedia.org/wiki/List_of_cluster_management_software
[2] https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

> > +Proactive reclaim
> > +-----------------
> > +Proactive reclaim induces memory reclaim when there is no memory
> > +pressure and usually targets cold memory only. E.g., when a new job
> > +comes in, the job scheduler wants to proactively reclaim memory on the
> > +server it has selected to improve the chance of successfully landing
> > +this new job.
> > +
> > +Users can write ``- memcg_id node_id min_gen_nr [swappiness
> > +[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or
> > +equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than
> > +``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully
> > +aged and therefore cannot be evicted. ``swappiness`` overrides the
> > +default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits
> > +the number of pages to evict.
> > +
> > +A typical use case is that a job scheduler writes to ``lru_gen``
> > +before it tries to land a new job on a server, and if it fails to
> > +materialize the cold memory without impacting the existing jobs on
> > +this server, it retries on the next server according to the ranking
> > +result obtained from the working set estimation step described
> > +earlier.
>
> It sounds to me that these interfaces were developed in response to
> ongoing development and use of a particular job scheduler.

I did borrow some of my previous experience with Google's data
centers. But I'm a Chrome OS developer now, so I designed them to be
job scheduler agnostic :)

> This is a very good thing, but has thought been given to the potential
> needs of other job schedulers?

Yes, basically I'm trying to help everybody replicate the success
stories at Google and Meta [3][4].

[3] https://dl.acm.org/doi/10.1145/3297858.3304053
[4] https://dl.acm.org/doi/10.1145/3503222.3507731
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index c21b5823f126..2cf5bae62036 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -32,6 +32,7 @@  the Linux memory management.
    idle_page_tracking
    ksm
    memory-hotplug
+   multigen_lru
    nommu-mmap
    numa_memory_policy
    numaperf
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
new file mode 100644
index 000000000000..3d9a6ef84229
--- /dev/null
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -0,0 +1,152 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Multi-Gen LRU
+=============
+The multi-gen LRU is an alternative LRU implementation that optimizes
+page reclaim and improves performance under memory pressure. Page
+reclaim decides the kernel's caching policy and ability to overcommit
+memory. It directly impacts the kswapd CPU usage and RAM efficiency.
+
+Quick start
+===========
+Build the kernel with the following configurations.
+
+* ``CONFIG_LRU_GEN=y``
+* ``CONFIG_LRU_GEN_ENABLED=y``
+
+All set!
+
+Runtime options
+===============
+``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
+following subsections.
+
+Kill switch
+-----------
+``enable`` accepts different values to enable or disable the following
+components. Its default value depends on ``CONFIG_LRU_GEN_ENABLED``.
+All the components should be enabled unless some of them have
+unforeseen side effects. Writing to ``enable`` has no effect when a
+component is not supported by the hardware, and valid values will be
+accepted even when the main switch is off.
+
+====== ===============================================================
+Values Components
+====== ===============================================================
+0x0001 The main switch for the multi-gen LRU.
+0x0002 Clearing the accessed bit in leaf page table entries in large
+       batches, when MMU sets it (e.g., on x86). This behavior can
+       theoretically worsen lock contention (mmap_lock). If it is
+       disabled, the multi-gen LRU will suffer a minor performance
+       degradation.
+0x0004 Clearing the accessed bit in non-leaf page table entries as
+       well, when MMU sets it (e.g., on x86). This behavior was not
+       verified on x86 varieties other than Intel and AMD. If it is
+       disabled, the multi-gen LRU will suffer a negligible
+       performance degradation.
+[yYnN] Apply to all the components above.
+====== ===============================================================
+
+E.g.,
+::
+
+    echo y >/sys/kernel/mm/lru_gen/enabled
+    cat /sys/kernel/mm/lru_gen/enabled
+    0x0007
+    echo 5 >/sys/kernel/mm/lru_gen/enabled
+    cat /sys/kernel/mm/lru_gen/enabled
+    0x0005
+
+Thrashing prevention
+--------------------
+Personal computers are more sensitive to thrashing because it can
+cause janks (lags when rendering UI) and negatively impact user
+experience. The multi-gen LRU offers thrashing prevention to the
+majority of laptop and desktop users who do not have ``oomd``.
+
+Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
+``N`` milliseconds from getting evicted. The OOM killer is triggered
+if this working set cannot be kept in memory. In other words, this
+option works as an adjustable pressure relief valve, and when open, it
+terminates applications that are hopefully not being used.
+
+Based on the average human detectable lag (~100ms), ``N=1000`` usually
+eliminates intolerable janks due to thrashing. Larger values like
+``N=3000`` make janks less noticeable at the risk of premature OOM
+kills.
+
+The default value ``0`` means disabled.
+
+Experimental features
+=====================
+``/sys/kernel/debug/lru_gen`` accepts commands described in the
+following subsections. Multiple command lines are supported, so does
+concatenation with delimiters ``,`` and ``;``.
+
+``/sys/kernel/debug/lru_gen_full`` provides additional stats for
+debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
+evicted generations in this file.
+
+Working set estimation
+----------------------
+Working set estimation measures how much memory an application
+requires in a given time interval, and it is usually done with little
+impact on the performance of the application. E.g., data centers want
+to optimize job scheduling (bin packing) to improve memory
+utilizations. When a new job comes in, the job scheduler needs to find
+out whether each server it manages can allocate a certain amount of
+memory for this new job before it can pick a candidate. To do so, this
+job scheduler needs to estimate the working sets of the existing jobs.
+
+When it is read, ``lru_gen`` returns a histogram of numbers of pages
+accessed over different time intervals for each memcg and node.
+``MAX_NR_GENS`` decides the number of bins for each histogram.
+::
+
+    memcg  memcg_id  memcg_path
+       node  node_id
+           min_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
+           ...
+           max_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
+
+Each generation contains an estimated number of pages that have been
+accessed within ``age_in_ms`` non-cumulatively. E.g., ``min_gen_nr``
+contains the coldest pages and ``max_gen_nr`` contains the hottest
+pages, since ``age_in_ms`` of the former is the largest and that of
+the latter is the smallest.
+
+Users can write ``+ memcg_id node_id max_gen_nr
+[can_swap[full_scan]]`` to ``lru_gen`` to create a new generation
+``max_gen_nr+1``. ``can_swap`` defaults to the swap setting and, if it
+is set to ``1``, it forces the scan of anon pages when swap is off.
+``full_scan`` defaults to ``1`` and, if it is set to ``0``, it reduces
+the overhead as well as the coverage when scanning page tables.
+
+A typical use case is that a job scheduler writes to ``lru_gen`` at a
+certain time interval to create new generations, and it ranks the
+servers it manages based on the sizes of their cold memory defined by
+this time interval.
+
+Proactive reclaim
+-----------------
+Proactive reclaim induces memory reclaim when there is no memory
+pressure and usually targets cold memory only. E.g., when a new job
+comes in, the job scheduler wants to proactively reclaim memory on the
+server it has selected to improve the chance of successfully landing
+this new job.
+
+Users can write ``- memcg_id node_id min_gen_nr [swappiness
+[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or
+equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than
+``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully
+aged and therefore cannot be evicted. ``swappiness`` overrides the
+default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits
+the number of pages to evict.
+
+A typical use case is that a job scheduler writes to ``lru_gen``
+before it tries to land a new job on a server, and if it fails to
+materialize the cold memory without impacting the existing jobs on
+this server, it retries on the next server according to the ranking
+result obtained from the working set estimation step described
+earlier.
diff --git a/mm/Kconfig b/mm/Kconfig
index b55546191369..2fb9d5efcf89 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -916,7 +916,8 @@  config LRU_GEN
 	# the following options can use up the spare bits in page flags
 	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
 	help
-	  A high performance LRU implementation to overcommit memory.
+	  A high performance LRU implementation to overcommit memory. See
+	  Documentation/admin-guide/mm/multigen_lru.rst for details.
 
 config LRU_GEN_ENABLED
 	bool "Enable by default"