[v7,12/12] mm: multigenerational LRU: documentation

Message ID	20220208081902.3550911-13-yuzhao@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Tue, 8 Feb 2022 01:19:02 -0700 In-Reply-To: <20220208081902.3550911-1-yuzhao@google.com> Message-Id: <20220208081902.3550911-13-yuzhao@google.com> Mime-Version: 1.0 References: <20220208081902.3550911-1-yuzhao@google.com> Subject: [PATCH v7 12/12] mm: multigenerational LRU: documentation From: Yu Zhao <yuzhao@google.com> To: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, Mel Gorman <mgorman@suse.de>, Michal Hocko <mhocko@kernel.org> Cc: Andi Kleen <ak@linux.intel.com>, Aneesh Kumar <aneesh.kumar@linux.ibm.com>, Barry Song <21cnbao@gmail.com>, Catalin Marinas <catalin.marinas@arm.com>, Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>, Jens Axboe <axboe@kernel.dk>, Jesse Barnes <jsbarnes@google.com>, Jonathan Corbet <corbet@lwn.net>, Linus Torvalds <torvalds@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, Michael Larabel <Michael@michaellarabel.com>, Mike Rapoport <rppt@kernel.org>, Rik van Riel <riel@surriel.com>, Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>, Ying Huang <ying.huang@intel.com>, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, page-reclaim@google.com, x86@kernel.org, Yu Zhao <yuzhao@google.com>, Brian Geffon <bgeffon@google.com>, Jan Alexander Steffens <heftig@archlinux.org>, Oleksandr Natalenko <oleksandr@natalenko.name>, Steven Barrett <steven@liquorix.net>, Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>, Donald Carr <d@chaos-reins.com>, " =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>, Konstantin Kharlamov <Hi-Angel@yandex.ru>, Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Multigenerational LRU Framework \| expand [v7,00/12] Multigenerational LRU Framework [v7,01/12] mm: x86, arm64: add arch_has_hw_pte_young() [v7,02/12] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG [v7,03/12] mm/vmscan.c: refactor shrink_node() [v7,04/12] mm: multigenerational LRU: groundwork [v7,05/12] mm: multigenerational LRU: minimal implementation [v7,06/12] mm: multigenerational LRU: exploit locality in rmap [v7,07/12] mm: multigenerational LRU: support page table walks [v7,08/12] mm: multigenerational LRU: optimize multiple memcgs [v7,09/12] mm: multigenerational LRU: runtime switch [v7,10/12] mm: multigenerational LRU: thrashing prevention [v7,11/12] mm: multigenerational LRU: debugfs interface [v7,12/12] mm: multigenerational LRU: documentation

Yu Zhao Feb. 8, 2022, 8:19 a.m. UTC

Add a design doc and an admin guide.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 Documentation/admin-guide/mm/index.rst        |   1 +
 Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++
 Documentation/vm/index.rst                    |   1 +
 Documentation/vm/multigen_lru.rst             | 152 ++++++++++++++++++
 4 files changed, 275 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
 create mode 100644 Documentation/vm/multigen_lru.rst

Yu Zhao Feb. 8, 2022, 8:44 a.m. UTC | #1

On Tue, Feb 08, 2022 at 01:19:02AM -0700, Yu Zhao wrote:
> Add a design doc and an admin guide.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> ---
>  Documentation/admin-guide/mm/index.rst        |   1 +
>  Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++
>  Documentation/vm/index.rst                    |   1 +
>  Documentation/vm/multigen_lru.rst             | 152 ++++++++++++++++++
>  4 files changed, 275 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
>  create mode 100644 Documentation/vm/multigen_lru.rst

Refactored the doc into a separate patch as requested here:
https://lore.kernel.org/linux-mm/Yd73pDkMOMVHhXzu@kernel.org/

Reworked the doc as requested here:
https://lore.kernel.org/linux-mm/YdwKB3SfF7hkB9Xv@kernel.org/

<snipped>

Mike Rapoport Feb. 14, 2022, 10:28 a.m. UTC | #2

Hi,

On Tue, Feb 08, 2022 at 01:19:02AM -0700, Yu Zhao wrote:
> Add a design doc and an admin guide.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> ---
>  Documentation/admin-guide/mm/index.rst        |   1 +
>  Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++
>  Documentation/vm/index.rst                    |   1 +
>  Documentation/vm/multigen_lru.rst             | 152 ++++++++++++++++++

Please consider splitting this patch into Documentation/admin-guide and
Documentation/vm parts.

For now I only had time to review the admin-guide part.

>  4 files changed, 275 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
>  create mode 100644 Documentation/vm/multigen_lru.rst
> 
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index c21b5823f126..2cf5bae62036 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -32,6 +32,7 @@ the Linux memory management.
>     idle_page_tracking
>     ksm
>     memory-hotplug
> +   multigen_lru
>     nommu-mmap
>     numa_memory_policy
>     numaperf
> diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
> new file mode 100644
> index 000000000000..16a543c8b886
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/multigen_lru.rst
> @@ -0,0 +1,121 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Multigenerational LRU
> +=====================
+
> +Quick start
> +===========

There is no explanation why one would want to use multigenerational LRU
until the next section.

I think there should be an overview that explains why users would want to
enable multigenerational LRU. 

> +Build configurations
> +--------------------
> +:Required: Set ``CONFIG_LRU_GEN=y``.

Maybe 

	Set ``CONFIG_LRU_GEN=y`` to build kernel with multigenerational LRU

> +
> +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to enable the
> + multigenerational LRU by default.
> +
> +Runtime configurations
> +----------------------
> +:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if
> + ``CONFIG_LRU_GEN_ENABLED=n``.
> +
> +This file accepts different values to enabled or disabled the
> +following features:

Maybe

  After multigenerational LRU is enabled, this file accepts different
  values to enable or disable the following feaures:

> +====== ========
> +Values Features
> +====== ========
> +0x0001 the multigenerational LRU

The multigenerational LRU what?

What will happen if I write 0x2 to this file?
Please consider splitting "enable" and "features" attributes.

> +0x0002 clear the accessed bit in leaf page table entries **in large
> +       batches**, when MMU sets it (e.g., on x86)

Is extra markup really needed here...

> +0x0004 clear the accessed bit in non-leaf page table entries **as
> +       well**, when MMU sets it (e.g., on x86)

... and here?

As for the descriptions, what is the user-visible effect of these features?
How different modes of clearing the access bit are reflected in, say, GUI
responsiveness, database TPS, or probability of OOM?

> +[yYnN] apply to all the features above
> +====== ========
> +
> +E.g.,
> +::
> +
> +    echo y >/sys/kernel/mm/lru_gen/enabled
> +    cat /sys/kernel/mm/lru_gen/enabled
> +    0x0007
> +    echo 5 >/sys/kernel/mm/lru_gen/enabled
> +    cat /sys/kernel/mm/lru_gen/enabled
> +    0x0005
> +
> +Most users should enable or disable all the features unless some of
> +them have unforeseen side effects.
> +
> +Recipes
> +=======
> +Personal computers
> +------------------
> +Personal computers are more sensitive to thrashing because it can
> +cause janks (lags when rendering UI) and negatively impact user
> +experience. The multigenerational LRU offers thrashing prevention to
> +the majority of laptop and desktop users who don't have oomd.

I'd expect something like this paragraph in overview.

> +
> +:Thrashing prevention: Write ``N`` to
> + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> + ``N`` milliseconds from getting evicted. The OOM killer is triggered
> + if this working set can't be kept in memory. Based on the average
> + human detectable lag (~100ms), ``N=1000`` usually eliminates
> + intolerable janks due to thrashing. Larger values like ``N=3000``
> + make janks less noticeable at the risk of premature OOM kills.

> +
> +Data centers
> +------------
> +Data centers want to optimize job scheduling (bin packing) to improve
> +memory utilizations. Job schedulers need to estimate whether a server
> +can allocate a certain amount of memory for a new job, and this step
> +is known as working set estimation, which doesn't impact the existing
> +jobs running on this server. They also want to attempt freeing some
> +cold memory from the existing jobs, and this step is known as proactive
> +reclaim, which improves the chance of landing a new job successfully.

This paragraph also fits overview.

> +
> +:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations
> + for working set estimation and proactive reclaim.

Please add a note that this is build time option.

> +
> +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following

Is debugfs interface relevant only for datacenters? 

> + format:
> + ::
> +
> +   memcg  memcg_id  memcg_path
> +     node  node_id
> +       min_gen  birth_time  anon_size  file_size
> +       ...
> +       max_gen  birth_time  anon_size  file_size
> +
> + ``min_gen`` is the oldest generation number and ``max_gen`` is the
> + youngest generation number. ``birth_time`` is in milliseconds.

It's unclear what is birth_time reference point. Is it milliseconds from
the system start or it is measured some other way?

> + ``anon_size`` and ``file_size`` are in pages. The youngest generation
> + represents the group of the MRU pages and the oldest generation
> + represents the group of the LRU pages. For working set estimation, a

Please spell out MRU and LRU fully.

> + job scheduler writes to this file at a certain time interval to
> + create new generations, and it ranks available servers based on the
> + sizes of their cold memory defined by this time interval. For
> + proactive reclaim, a job scheduler writes to this file before it
> + tries to land a new job, and if it fails to materialize the cold
> + memory without impacting the existing jobs, it retries on the next
> + server according to the ranking result.

Is this knob only relevant for a job scheduler? Or it can be used in other
use-cases as well?

> +
> + This file accepts commands in the following subsections. Multiple

                              ^ described

> + command lines are supported, so does concatenation with delimiters
> + ``,`` and ``;``.
> +
> + ``/sys/kernel/debug/lru_gen_full`` contains additional stats for
> + debugging.
> +
> +:Working set estimation: Write ``+ memcg_id node_id max_gen
> + [can_swap [full_scan]]`` to ``/sys/kernel/debug/lru_gen`` to invoke
> + the aging. It scans PTEs for hot pages and promotes them to the
> + youngest generation ``max_gen``. Then it creates a new generation
> + ``max_gen+1``. Set ``can_swap`` to ``1`` to scan for hot anon pages
> + when swap is off. Set ``full_scan`` to ``0`` to reduce the overhead
> + as well as the coverage when scanning PTEs.
> +
> +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness
> + [nr_to_reclaim]]`` to ``/sys/kernel/debug/lru_gen`` to invoke the
> + eviction. It evicts generations less than or equal to ``min_gen``.
> + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and
> + ``max_gen-1`` aren't fully aged and therefore can't be evicted. Use
> + ``nr_to_reclaim`` to limit the number of pages to evict.

I feel that /sys/kernel/debug/lru_gen is too overloaded.

> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> index 44365c4574a3..b48434300226 100644
> --- a/Documentation/vm/index.rst
> +++ b/Documentation/vm/index.rst
> @@ -25,6 +25,7 @@ algorithms.  If you are looking for advice on simply allocating memory, see the
>     ksm
>     memory-model
>     mmu_notifier
> +   multigen_lru
>     numa
>     overcommit-accounting
>     page_migration

Yu Zhao Feb. 16, 2022, 3:22 a.m. UTC | #3

On Mon, Feb 14, 2022 at 12:28:56PM +0200, Mike Rapoport wrote:

Thanks for reviewing.

> >  Documentation/admin-guide/mm/index.rst        |   1 +
> >  Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++
> >  Documentation/vm/index.rst                    |   1 +
> >  Documentation/vm/multigen_lru.rst             | 152 ++++++++++++++++++
> 
> Please consider splitting this patch into Documentation/admin-guide and
> Documentation/vm parts.

Will do.

> > +=====================
> > +Multigenerational LRU
> > +=====================
> +
> > +Quick start
> > +===========
> 
> There is no explanation why one would want to use multigenerational LRU
> until the next section.
> 
> I think there should be an overview that explains why users would want to
> enable multigenerational LRU. 

Will do.

> > +Build configurations
> > +--------------------
> > +:Required: Set ``CONFIG_LRU_GEN=y``.
> 
> Maybe 
> 
> 	Set ``CONFIG_LRU_GEN=y`` to build kernel with multigenerational LRU

Will do.

> > +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to enable the
> > + multigenerational LRU by default.
> > +
> > +Runtime configurations
> > +----------------------
> > +:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if
> > + ``CONFIG_LRU_GEN_ENABLED=n``.
> > +
> > +This file accepts different values to enabled or disabled the
> > +following features:
> 
> Maybe
> 
>   After multigenerational LRU is enabled, this file accepts different
>   values to enable or disable the following feaures:

Will do.

> > +====== ========
> > +Values Features
> > +====== ========
> > +0x0001 the multigenerational LRU
> 
> The multigenerational LRU what?

Itself? This depends on the POV, and I'm trying to determine what would
be the natural way to present it.

MGLRU itself could be seen as an add-on atop the existing page reclaim
or an alternative in parallel. The latter would be similar to sl[aou]b,
and that's how I personally see it.

But here I presented it more like the former because I feel this way is
more natural to users because they are like switches on a single panel.

> What will happen if I write 0x2 to this file?

Just like turning on a branch breaker while leaving the main breaker
off in a circuit breaker box. This is how I see it, and I'm totally
fine with changing it to whatever you'd recommend.

> Please consider splitting "enable" and "features" attributes.

How about s/Features/Components/?

> > +0x0002 clear the accessed bit in leaf page table entries **in large
> > +       batches**, when MMU sets it (e.g., on x86)
> 
> Is extra markup really needed here...
> 
> > +0x0004 clear the accessed bit in non-leaf page table entries **as
> > +       well**, when MMU sets it (e.g., on x86)
> 
> ... and here?

Will do.

> As for the descriptions, what is the user-visible effect of these features?
> How different modes of clearing the access bit are reflected in, say, GUI
> responsiveness, database TPS, or probability of OOM?

These remain to be seen :) I just added these switches in v7, per Mel's
request from the meeting we had. These were never tested in the field.

> > +[yYnN] apply to all the features above
> > +====== ========
> > +
> > +E.g.,
> > +::
> > +
> > +    echo y >/sys/kernel/mm/lru_gen/enabled
> > +    cat /sys/kernel/mm/lru_gen/enabled
> > +    0x0007
> > +    echo 5 >/sys/kernel/mm/lru_gen/enabled
> > +    cat /sys/kernel/mm/lru_gen/enabled
> > +    0x0005
> > +
> > +Most users should enable or disable all the features unless some of
> > +them have unforeseen side effects.
> > +
> > +Recipes
> > +=======
> > +Personal computers
> > +------------------
> > +Personal computers are more sensitive to thrashing because it can
> > +cause janks (lags when rendering UI) and negatively impact user
> > +experience. The multigenerational LRU offers thrashing prevention to
> > +the majority of laptop and desktop users who don't have oomd.
> 
> I'd expect something like this paragraph in overview.
> 
> > +
> > +:Thrashing prevention: Write ``N`` to
> > + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> > + ``N`` milliseconds from getting evicted. The OOM killer is triggered
> > + if this working set can't be kept in memory. Based on the average
> > + human detectable lag (~100ms), ``N=1000`` usually eliminates
> > + intolerable janks due to thrashing. Larger values like ``N=3000``
> > + make janks less noticeable at the risk of premature OOM kills.
> 
> > +
> > +Data centers
> > +------------
> > +Data centers want to optimize job scheduling (bin packing) to improve
> > +memory utilizations. Job schedulers need to estimate whether a server
> > +can allocate a certain amount of memory for a new job, and this step
> > +is known as working set estimation, which doesn't impact the existing
> > +jobs running on this server. They also want to attempt freeing some
> > +cold memory from the existing jobs, and this step is known as proactive
> > +reclaim, which improves the chance of landing a new job successfully.
> 
> This paragraph also fits overview.

Will do.

> > +:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations
> > + for working set estimation and proactive reclaim.
> 
> Please add a note that this is build time option.

Will do.

> > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> 
> Is debugfs interface relevant only for datacenters? 

For the moment, yes.

> > + format:
> > + ::
> > +
> > +   memcg  memcg_id  memcg_path
> > +     node  node_id
> > +       min_gen  birth_time  anon_size  file_size
> > +       ...
> > +       max_gen  birth_time  anon_size  file_size
> > +
> > + ``min_gen`` is the oldest generation number and ``max_gen`` is the
> > + youngest generation number. ``birth_time`` is in milliseconds.
> 
> It's unclear what is birth_time reference point. Is it milliseconds from
> the system start or it is measured some other way?

Good point. Will clarify.

> > + ``anon_size`` and ``file_size`` are in pages. The youngest generation
> > + represents the group of the MRU pages and the oldest generation
> > + represents the group of the LRU pages. For working set estimation, a
> 
> Please spell out MRU and LRU fully.

Will do.

> > + job scheduler writes to this file at a certain time interval to
> > + create new generations, and it ranks available servers based on the
> > + sizes of their cold memory defined by this time interval. For
> > + proactive reclaim, a job scheduler writes to this file before it
> > + tries to land a new job, and if it fails to materialize the cold
> > + memory without impacting the existing jobs, it retries on the next
> > + server according to the ranking result.
> 
> Is this knob only relevant for a job scheduler? Or it can be used in other
> use-cases as well?

There are other concrete use cases but I'm not ready to discuss them
yet.

> > + This file accepts commands in the following subsections. Multiple
> 
>                               ^ described

Will do.

Mike Rapoport Feb. 21, 2022, 9:01 a.m. UTC | #4

On Tue, Feb 15, 2022 at 08:22:10PM -0700, Yu Zhao wrote:
> On Mon, Feb 14, 2022 at 12:28:56PM +0200, Mike Rapoport wrote:
> 
> > > +====== ========
> > > +Values Features
> > > +====== ========
> > > +0x0001 the multigenerational LRU
> > 
> > The multigenerational LRU what?
> 
> Itself? This depends on the POV, and I'm trying to determine what would
> be the natural way to present it.
> 
> MGLRU itself could be seen as an add-on atop the existing page reclaim
> or an alternative in parallel. The latter would be similar to sl[aou]b,
> and that's how I personally see it.
> 
> But here I presented it more like the former because I feel this way is
> more natural to users because they are like switches on a single panel.

Than I think it should be described as "enable multigenerational LRU" or
something like this.
 
> > What will happen if I write 0x2 to this file?
> 
> Just like turning on a branch breaker while leaving the main breaker
> off in a circuit breaker box. This is how I see it, and I'm totally
> fine with changing it to whatever you'd recommend.

That was my guess that when bit 0 is clear the rest do not matter :)
What's important, IMO, is that it is stated explicitly in the description.
 
> > Please consider splitting "enable" and "features" attributes.
> 
> How about s/Features/Components/?

I meant to use two attributes:

/sys/kernel/mm/lru_gen/enable for the main breaker, and
/sys/kernel/mm/lru_gen/features (or components) for the branch breakers
 
> > > +0x0002 clear the accessed bit in leaf page table entries **in large
> > > +       batches**, when MMU sets it (e.g., on x86)
> > 
> > Is extra markup really needed here...
> > 
> > > +0x0004 clear the accessed bit in non-leaf page table entries **as
> > > +       well**, when MMU sets it (e.g., on x86)
> > 
> > ... and here?
> 
> Will do.
> 
> > As for the descriptions, what is the user-visible effect of these features?
> > How different modes of clearing the access bit are reflected in, say, GUI
> > responsiveness, database TPS, or probability of OOM?
> 
> These remain to be seen :) I just added these switches in v7, per Mel's
> request from the meeting we had. These were never tested in the field.

I see :)

It would be nice to have a description or/and examples of user-visible
effects when there will be some insight on what these features do.

> > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> > 
> > Is debugfs interface relevant only for datacenters? 
> 
> For the moment, yes.

And what will happen if somebody uses these interfaces outside
datacenters? As soon as there is a sysfs intefrace, somebody will surely
play with it.

I think the job schedulers might be the most important user of that
interface, but the documentation should not presume it is the only user.
 
> > > + job scheduler writes to this file at a certain time interval to
> > > + create new generations, and it ranks available servers based on the
> > > + sizes of their cold memory defined by this time interval. For
> > > + proactive reclaim, a job scheduler writes to this file before it
> > > + tries to land a new job, and if it fails to materialize the cold
> > > + memory without impacting the existing jobs, it retries on the next
> > > + server according to the ranking result.
> > 
> > Is this knob only relevant for a job scheduler? Or it can be used in other
> > use-cases as well?
> 
> There are other concrete use cases but I'm not ready to discuss them
> yet.
 
Here as well, as soon as there is an interface it's not necessarily "job
scheduler" that will "write to this file", anybody can write to that file.
Please adjust the documentation to be more neutral regarding the use-cases.

Yu Zhao Feb. 22, 2022, 1:47 a.m. UTC | #5

On Mon, Feb 21, 2022 at 2:02 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Feb 15, 2022 at 08:22:10PM -0700, Yu Zhao wrote:
> > On Mon, Feb 14, 2022 at 12:28:56PM +0200, Mike Rapoport wrote:
> >
> > > > +====== ========
> > > > +Values Features
> > > > +====== ========
> > > > +0x0001 the multigenerational LRU
> > >
> > > The multigenerational LRU what?
> >
> > Itself? This depends on the POV, and I'm trying to determine what would
> > be the natural way to present it.
> >
> > MGLRU itself could be seen as an add-on atop the existing page reclaim
> > or an alternative in parallel. The latter would be similar to sl[aou]b,
> > and that's how I personally see it.
> >
> > But here I presented it more like the former because I feel this way is
> > more natural to users because they are like switches on a single panel.
>
> Than I think it should be described as "enable multigenerational LRU" or
> something like this.

Will do.

> > > What will happen if I write 0x2 to this file?
> >
> > Just like turning on a branch breaker while leaving the main breaker
> > off in a circuit breaker box. This is how I see it, and I'm totally
> > fine with changing it to whatever you'd recommend.
>
> That was my guess that when bit 0 is clear the rest do not matter :)
> What's important, IMO, is that it is stated explicitly in the description.

Will do.

> > > Please consider splitting "enable" and "features" attributes.
> >
> > How about s/Features/Components/?
>
> I meant to use two attributes:
>
> /sys/kernel/mm/lru_gen/enable for the main breaker, and
> /sys/kernel/mm/lru_gen/features (or components) for the branch breakers

It's a bit superfluous for my taste. I generally consider multiple
items to fall into the same category if they can be expressed by a
type of array, and I usually pack an array into a single file.

From your last review, I gauged this would be too overloaded for your
taste. So I'd be happy to make the change if you think two files look
more intuitive from user's perspective.

> > > > +0x0002 clear the accessed bit in leaf page table entries **in large
> > > > +       batches**, when MMU sets it (e.g., on x86)
> > >
> > > Is extra markup really needed here...
> > >
> > > > +0x0004 clear the accessed bit in non-leaf page table entries **as
> > > > +       well**, when MMU sets it (e.g., on x86)
> > >
> > > ... and here?
> >
> > Will do.
> >
> > > As for the descriptions, what is the user-visible effect of these features?
> > > How different modes of clearing the access bit are reflected in, say, GUI
> > > responsiveness, database TPS, or probability of OOM?
> >
> > These remain to be seen :) I just added these switches in v7, per Mel's
> > request from the meeting we had. These were never tested in the field.
>
> I see :)
>
> It would be nice to have a description or/and examples of user-visible
> effects when there will be some insight on what these features do.

How does the following sound?

Clearing the accessed bit in large batches can theoretically cause
lock contention (mmap_lock), and if it happens the 0x0002 switch can
disable this feature. In this case the multigenerational LRU suffers a
minor performance degradation.
Clearing the accessed bit in non-leaf page table entries was only
verified on Intel and AMD, and if it causes problems on other x86
varieties the 0x0004 switch can disable this feature. In this case the
multigenerational LRU suffers a negligible performance degradation.

> > > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> > >
> > > Is debugfs interface relevant only for datacenters?
> >
> > For the moment, yes.
>
> And what will happen if somebody uses these interfaces outside
> datacenters? As soon as there is a sysfs intefrace, somebody will surely
> play with it.
>
> I think the job schedulers might be the most important user of that
> interface, but the documentation should not presume it is the only user.

Other ideas are more like brainstorming than concrete use cases, e.g.,
for desktop users, these interface can in theory speed up hibernation
(suspend to disk); for VM users, they can again in theory support auto
ballooning. These niches are really minor and less explored compared
with the data center use cases which have been dominant.

I was hoping we could focus on the essential and take one step at a
time. Later on, if there is additional demand and resource, then we
expand to cover more use cases.

> > > > + job scheduler writes to this file at a certain time interval to
> > > > + create new generations, and it ranks available servers based on the
> > > > + sizes of their cold memory defined by this time interval. For
> > > > + proactive reclaim, a job scheduler writes to this file before it
> > > > + tries to land a new job, and if it fails to materialize the cold
> > > > + memory without impacting the existing jobs, it retries on the next
> > > > + server according to the ranking result.
> > >
> > > Is this knob only relevant for a job scheduler? Or it can be used in other
> > > use-cases as well?
> >
> > There are other concrete use cases but I'm not ready to discuss them
> > yet.
>
> Here as well, as soon as there is an interface it's not necessarily "job
> scheduler" that will "write to this file", anybody can write to that file.
> Please adjust the documentation to be more neutral regarding the use-cases.

Will do.

Mike Rapoport Feb. 23, 2022, 10:58 a.m. UTC | #6

On Mon, Feb 21, 2022 at 06:47:25PM -0700, Yu Zhao wrote:
> On Mon, Feb 21, 2022 at 2:02 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Tue, Feb 15, 2022 at 08:22:10PM -0700, Yu Zhao wrote:
> > > > Please consider splitting "enable" and "features" attributes.
> > >
> > > How about s/Features/Components/?
> >
> > I meant to use two attributes:
> >
> > /sys/kernel/mm/lru_gen/enable for the main breaker, and
> > /sys/kernel/mm/lru_gen/features (or components) for the branch breakers
> 
> It's a bit superfluous for my taste. I generally consider multiple
> items to fall into the same category if they can be expressed by a
> type of array, and I usually pack an array into a single file.
> 
> From your last review, I gauged this would be too overloaded for your
> taste. So I'd be happy to make the change if you think two files look
> more intuitive from user's perspective.
 
I do think that two attributes are more user-friendly, but I don't feel
strongly about it.

> > > > As for the descriptions, what is the user-visible effect of these features?
> > > > How different modes of clearing the access bit are reflected in, say, GUI
> > > > responsiveness, database TPS, or probability of OOM?
> > >
> > > These remain to be seen :) I just added these switches in v7, per Mel's
> > > request from the meeting we had. These were never tested in the field.
> >
> > I see :)
> >
> > It would be nice to have a description or/and examples of user-visible
> > effects when there will be some insight on what these features do.
> 
> How does the following sound?
> 
> Clearing the accessed bit in large batches can theoretically cause
> lock contention (mmap_lock), and if it happens the 0x0002 switch can
> disable this feature. In this case the multigenerational LRU suffers a
> minor performance degradation.
> Clearing the accessed bit in non-leaf page table entries was only
> verified on Intel and AMD, and if it causes problems on other x86
> varieties the 0x0004 switch can disable this feature. In this case the
> multigenerational LRU suffers a negligible performance degradation.
 
LGTM

> > > > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> > > >
> > > > Is debugfs interface relevant only for datacenters?
> > >
> > > For the moment, yes.
> >
> > And what will happen if somebody uses these interfaces outside
> > datacenters? As soon as there is a sysfs intefrace, somebody will surely
> > play with it.
> >
> > I think the job schedulers might be the most important user of that
> > interface, but the documentation should not presume it is the only user.
> 
> Other ideas are more like brainstorming than concrete use cases, e.g.,
> for desktop users, these interface can in theory speed up hibernation
> (suspend to disk); for VM users, they can again in theory support auto
> ballooning. These niches are really minor and less explored compared
> with the data center use cases which have been dominant.
> 
> I was hoping we could focus on the essential and take one step at a
> time. Later on, if there is additional demand and resource, then we
> expand to cover more use cases.

Apparently I was not clear :)

I didn't mean that you should describe other use-cases, I rather suggested
to make the documentation more neutral, e.g. using "a user writes to this
file ..." instead of "job scheduler writes to a file ...". Or maybe add a
sentence in the beginning of the "Data centers" section, for instance:

Data centers
------------

+ A representative example of multigenerational LRU users are job
schedulers.

Data centers want to optimize job scheduling (bin packing) to improve
memory utilizations. Job schedulers need to estimate whether a server

Yu Zhao Feb. 23, 2022, 9:20 p.m. UTC | #7

On Wed, Feb 23, 2022 at 3:58 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Mon, Feb 21, 2022 at 06:47:25PM -0700, Yu Zhao wrote:
> > On Mon, Feb 21, 2022 at 2:02 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > On Tue, Feb 15, 2022 at 08:22:10PM -0700, Yu Zhao wrote:
> > > > > Please consider splitting "enable" and "features" attributes.
> > > >
> > > > How about s/Features/Components/?
> > >
> > > I meant to use two attributes:
> > >
> > > /sys/kernel/mm/lru_gen/enable for the main breaker, and
> > > /sys/kernel/mm/lru_gen/features (or components) for the branch breakers
> >
> > It's a bit superfluous for my taste. I generally consider multiple
> > items to fall into the same category if they can be expressed by a
> > type of array, and I usually pack an array into a single file.
> >
> > From your last review, I gauged this would be too overloaded for your
> > taste. So I'd be happy to make the change if you think two files look
> > more intuitive from user's perspective.
>
> I do think that two attributes are more user-friendly, but I don't feel
> strongly about it.
>
> > > > > As for the descriptions, what is the user-visible effect of these features?
> > > > > How different modes of clearing the access bit are reflected in, say, GUI
> > > > > responsiveness, database TPS, or probability of OOM?
> > > >
> > > > These remain to be seen :) I just added these switches in v7, per Mel's
> > > > request from the meeting we had. These were never tested in the field.
> > >
> > > I see :)
> > >
> > > It would be nice to have a description or/and examples of user-visible
> > > effects when there will be some insight on what these features do.
> >
> > How does the following sound?
> >
> > Clearing the accessed bit in large batches can theoretically cause
> > lock contention (mmap_lock), and if it happens the 0x0002 switch can
> > disable this feature. In this case the multigenerational LRU suffers a
> > minor performance degradation.
> > Clearing the accessed bit in non-leaf page table entries was only
> > verified on Intel and AMD, and if it causes problems on other x86
> > varieties the 0x0004 switch can disable this feature. In this case the
> > multigenerational LRU suffers a negligible performance degradation.
>
> LGTM
>
> > > > > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> > > > >
> > > > > Is debugfs interface relevant only for datacenters?
> > > >
> > > > For the moment, yes.
> > >
> > > And what will happen if somebody uses these interfaces outside
> > > datacenters? As soon as there is a sysfs intefrace, somebody will surely
> > > play with it.
> > >
> > > I think the job schedulers might be the most important user of that
> > > interface, but the documentation should not presume it is the only user.
> >
> > Other ideas are more like brainstorming than concrete use cases, e.g.,
> > for desktop users, these interface can in theory speed up hibernation
> > (suspend to disk); for VM users, they can again in theory support auto
> > ballooning. These niches are really minor and less explored compared
> > with the data center use cases which have been dominant.
> >
> > I was hoping we could focus on the essential and take one step at a
> > time. Later on, if there is additional demand and resource, then we
> > expand to cover more use cases.
>
> Apparently I was not clear :)
>
> I didn't mean that you should describe other use-cases, I rather suggested
> to make the documentation more neutral, e.g. using "a user writes to this
> file ..." instead of "job scheduler writes to a file ...". Or maybe add a
> sentence in the beginning of the "Data centers" section, for instance:
>
> Data centers
> ------------
>
> + A representative example of multigenerational LRU users are job
> schedulers.
>
> Data centers want to optimize job scheduling (bin packing) to improve
> memory utilizations. Job schedulers need to estimate whether a server

Yes, that makes sense. Will do. Thanks.

[v7,12/12] mm: multigenerational LRU: documentation

Commit Message

Comments

Patch