diff mbox series

[v7,12/12] mm: multigenerational LRU: documentation

Message ID 20220208081902.3550911-13-yuzhao@google.com (mailing list archive)
State New, archived
Headers show
Series Multigenerational LRU Framework | expand

Commit Message

Yu Zhao Feb. 8, 2022, 8:19 a.m. UTC
Add a design doc and an admin guide.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 Documentation/admin-guide/mm/index.rst        |   1 +
 Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++
 Documentation/vm/index.rst                    |   1 +
 Documentation/vm/multigen_lru.rst             | 152 ++++++++++++++++++
 4 files changed, 275 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
 create mode 100644 Documentation/vm/multigen_lru.rst

Comments

Yu Zhao Feb. 8, 2022, 8:44 a.m. UTC | #1
On Tue, Feb 08, 2022 at 01:19:02AM -0700, Yu Zhao wrote:
> Add a design doc and an admin guide.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> ---
>  Documentation/admin-guide/mm/index.rst        |   1 +
>  Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++
>  Documentation/vm/index.rst                    |   1 +
>  Documentation/vm/multigen_lru.rst             | 152 ++++++++++++++++++
>  4 files changed, 275 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
>  create mode 100644 Documentation/vm/multigen_lru.rst

Refactored the doc into a separate patch as requested here:
https://lore.kernel.org/linux-mm/Yd73pDkMOMVHhXzu@kernel.org/

Reworked the doc as requested here:
https://lore.kernel.org/linux-mm/YdwKB3SfF7hkB9Xv@kernel.org/

<snipped>
Mike Rapoport Feb. 14, 2022, 10:28 a.m. UTC | #2
Hi,

On Tue, Feb 08, 2022 at 01:19:02AM -0700, Yu Zhao wrote:
> Add a design doc and an admin guide.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> ---
>  Documentation/admin-guide/mm/index.rst        |   1 +
>  Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++
>  Documentation/vm/index.rst                    |   1 +
>  Documentation/vm/multigen_lru.rst             | 152 ++++++++++++++++++

Please consider splitting this patch into Documentation/admin-guide and
Documentation/vm parts.

For now I only had time to review the admin-guide part.

>  4 files changed, 275 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
>  create mode 100644 Documentation/vm/multigen_lru.rst
> 
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index c21b5823f126..2cf5bae62036 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -32,6 +32,7 @@ the Linux memory management.
>     idle_page_tracking
>     ksm
>     memory-hotplug
> +   multigen_lru
>     nommu-mmap
>     numa_memory_policy
>     numaperf
> diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
> new file mode 100644
> index 000000000000..16a543c8b886
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/multigen_lru.rst
> @@ -0,0 +1,121 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Multigenerational LRU
> +=====================
+
> +Quick start
> +===========

There is no explanation why one would want to use multigenerational LRU
until the next section.

I think there should be an overview that explains why users would want to
enable multigenerational LRU. 

> +Build configurations
> +--------------------
> +:Required: Set ``CONFIG_LRU_GEN=y``.

Maybe 

	Set ``CONFIG_LRU_GEN=y`` to build kernel with multigenerational LRU

> +
> +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to enable the
> + multigenerational LRU by default.
> +
> +Runtime configurations
> +----------------------
> +:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if
> + ``CONFIG_LRU_GEN_ENABLED=n``.
> +
> +This file accepts different values to enabled or disabled the
> +following features:

Maybe

  After multigenerational LRU is enabled, this file accepts different
  values to enable or disable the following feaures:

> +====== ========
> +Values Features
> +====== ========
> +0x0001 the multigenerational LRU

The multigenerational LRU what?

What will happen if I write 0x2 to this file?
Please consider splitting "enable" and "features" attributes.

> +0x0002 clear the accessed bit in leaf page table entries **in large
> +       batches**, when MMU sets it (e.g., on x86)

Is extra markup really needed here...

> +0x0004 clear the accessed bit in non-leaf page table entries **as
> +       well**, when MMU sets it (e.g., on x86)

... and here?

As for the descriptions, what is the user-visible effect of these features?
How different modes of clearing the access bit are reflected in, say, GUI
responsiveness, database TPS, or probability of OOM?

> +[yYnN] apply to all the features above
> +====== ========
> +
> +E.g.,
> +::
> +
> +    echo y >/sys/kernel/mm/lru_gen/enabled
> +    cat /sys/kernel/mm/lru_gen/enabled
> +    0x0007
> +    echo 5 >/sys/kernel/mm/lru_gen/enabled
> +    cat /sys/kernel/mm/lru_gen/enabled
> +    0x0005
> +
> +Most users should enable or disable all the features unless some of
> +them have unforeseen side effects.
> +
> +Recipes
> +=======
> +Personal computers
> +------------------
> +Personal computers are more sensitive to thrashing because it can
> +cause janks (lags when rendering UI) and negatively impact user
> +experience. The multigenerational LRU offers thrashing prevention to
> +the majority of laptop and desktop users who don't have oomd.

I'd expect something like this paragraph in overview.

> +
> +:Thrashing prevention: Write ``N`` to
> + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> + ``N`` milliseconds from getting evicted. The OOM killer is triggered
> + if this working set can't be kept in memory. Based on the average
> + human detectable lag (~100ms), ``N=1000`` usually eliminates
> + intolerable janks due to thrashing. Larger values like ``N=3000``
> + make janks less noticeable at the risk of premature OOM kills.

> +
> +Data centers
> +------------
> +Data centers want to optimize job scheduling (bin packing) to improve
> +memory utilizations. Job schedulers need to estimate whether a server
> +can allocate a certain amount of memory for a new job, and this step
> +is known as working set estimation, which doesn't impact the existing
> +jobs running on this server. They also want to attempt freeing some
> +cold memory from the existing jobs, and this step is known as proactive
> +reclaim, which improves the chance of landing a new job successfully.

This paragraph also fits overview.

> +
> +:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations
> + for working set estimation and proactive reclaim.

Please add a note that this is build time option.

> +
> +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following

Is debugfs interface relevant only for datacenters? 

> + format:
> + ::
> +
> +   memcg  memcg_id  memcg_path
> +     node  node_id
> +       min_gen  birth_time  anon_size  file_size
> +       ...
> +       max_gen  birth_time  anon_size  file_size
> +
> + ``min_gen`` is the oldest generation number and ``max_gen`` is the
> + youngest generation number. ``birth_time`` is in milliseconds.

It's unclear what is birth_time reference point. Is it milliseconds from
the system start or it is measured some other way?

> + ``anon_size`` and ``file_size`` are in pages. The youngest generation
> + represents the group of the MRU pages and the oldest generation
> + represents the group of the LRU pages. For working set estimation, a

Please spell out MRU and LRU fully.

> + job scheduler writes to this file at a certain time interval to
> + create new generations, and it ranks available servers based on the
> + sizes of their cold memory defined by this time interval. For
> + proactive reclaim, a job scheduler writes to this file before it
> + tries to land a new job, and if it fails to materialize the cold
> + memory without impacting the existing jobs, it retries on the next
> + server according to the ranking result.

Is this knob only relevant for a job scheduler? Or it can be used in other
use-cases as well?

> +
> + This file accepts commands in the following subsections. Multiple

                              ^ described

> + command lines are supported, so does concatenation with delimiters
> + ``,`` and ``;``.
> +
> + ``/sys/kernel/debug/lru_gen_full`` contains additional stats for
> + debugging.
> +
> +:Working set estimation: Write ``+ memcg_id node_id max_gen
> + [can_swap [full_scan]]`` to ``/sys/kernel/debug/lru_gen`` to invoke
> + the aging. It scans PTEs for hot pages and promotes them to the
> + youngest generation ``max_gen``. Then it creates a new generation
> + ``max_gen+1``. Set ``can_swap`` to ``1`` to scan for hot anon pages
> + when swap is off. Set ``full_scan`` to ``0`` to reduce the overhead
> + as well as the coverage when scanning PTEs.
> +
> +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness
> + [nr_to_reclaim]]`` to ``/sys/kernel/debug/lru_gen`` to invoke the
> + eviction. It evicts generations less than or equal to ``min_gen``.
> + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and
> + ``max_gen-1`` aren't fully aged and therefore can't be evicted. Use
> + ``nr_to_reclaim`` to limit the number of pages to evict.

I feel that /sys/kernel/debug/lru_gen is too overloaded.

> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> index 44365c4574a3..b48434300226 100644
> --- a/Documentation/vm/index.rst
> +++ b/Documentation/vm/index.rst
> @@ -25,6 +25,7 @@ algorithms.  If you are looking for advice on simply allocating memory, see the
>     ksm
>     memory-model
>     mmu_notifier
> +   multigen_lru
>     numa
>     overcommit-accounting
>     page_migration
Yu Zhao Feb. 16, 2022, 3:22 a.m. UTC | #3
On Mon, Feb 14, 2022 at 12:28:56PM +0200, Mike Rapoport wrote:

Thanks for reviewing.

> >  Documentation/admin-guide/mm/index.rst        |   1 +
> >  Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++
> >  Documentation/vm/index.rst                    |   1 +
> >  Documentation/vm/multigen_lru.rst             | 152 ++++++++++++++++++
> 
> Please consider splitting this patch into Documentation/admin-guide and
> Documentation/vm parts.

Will do.

> > +=====================
> > +Multigenerational LRU
> > +=====================
> +
> > +Quick start
> > +===========
> 
> There is no explanation why one would want to use multigenerational LRU
> until the next section.
> 
> I think there should be an overview that explains why users would want to
> enable multigenerational LRU. 

Will do.

> > +Build configurations
> > +--------------------
> > +:Required: Set ``CONFIG_LRU_GEN=y``.
> 
> Maybe 
> 
> 	Set ``CONFIG_LRU_GEN=y`` to build kernel with multigenerational LRU

Will do.

> > +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to enable the
> > + multigenerational LRU by default.
> > +
> > +Runtime configurations
> > +----------------------
> > +:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if
> > + ``CONFIG_LRU_GEN_ENABLED=n``.
> > +
> > +This file accepts different values to enabled or disabled the
> > +following features:
> 
> Maybe
> 
>   After multigenerational LRU is enabled, this file accepts different
>   values to enable or disable the following feaures:

Will do.

> > +====== ========
> > +Values Features
> > +====== ========
> > +0x0001 the multigenerational LRU
> 
> The multigenerational LRU what?

Itself? This depends on the POV, and I'm trying to determine what would
be the natural way to present it.

MGLRU itself could be seen as an add-on atop the existing page reclaim
or an alternative in parallel. The latter would be similar to sl[aou]b,
and that's how I personally see it.

But here I presented it more like the former because I feel this way is
more natural to users because they are like switches on a single panel.

> What will happen if I write 0x2 to this file?

Just like turning on a branch breaker while leaving the main breaker
off in a circuit breaker box. This is how I see it, and I'm totally
fine with changing it to whatever you'd recommend.

> Please consider splitting "enable" and "features" attributes.

How about s/Features/Components/?

> > +0x0002 clear the accessed bit in leaf page table entries **in large
> > +       batches**, when MMU sets it (e.g., on x86)
> 
> Is extra markup really needed here...
> 
> > +0x0004 clear the accessed bit in non-leaf page table entries **as
> > +       well**, when MMU sets it (e.g., on x86)
> 
> ... and here?

Will do.

> As for the descriptions, what is the user-visible effect of these features?
> How different modes of clearing the access bit are reflected in, say, GUI
> responsiveness, database TPS, or probability of OOM?

These remain to be seen :) I just added these switches in v7, per Mel's
request from the meeting we had. These were never tested in the field.

> > +[yYnN] apply to all the features above
> > +====== ========
> > +
> > +E.g.,
> > +::
> > +
> > +    echo y >/sys/kernel/mm/lru_gen/enabled
> > +    cat /sys/kernel/mm/lru_gen/enabled
> > +    0x0007
> > +    echo 5 >/sys/kernel/mm/lru_gen/enabled
> > +    cat /sys/kernel/mm/lru_gen/enabled
> > +    0x0005
> > +
> > +Most users should enable or disable all the features unless some of
> > +them have unforeseen side effects.
> > +
> > +Recipes
> > +=======
> > +Personal computers
> > +------------------
> > +Personal computers are more sensitive to thrashing because it can
> > +cause janks (lags when rendering UI) and negatively impact user
> > +experience. The multigenerational LRU offers thrashing prevention to
> > +the majority of laptop and desktop users who don't have oomd.
> 
> I'd expect something like this paragraph in overview.
> 
> > +
> > +:Thrashing prevention: Write ``N`` to
> > + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> > + ``N`` milliseconds from getting evicted. The OOM killer is triggered
> > + if this working set can't be kept in memory. Based on the average
> > + human detectable lag (~100ms), ``N=1000`` usually eliminates
> > + intolerable janks due to thrashing. Larger values like ``N=3000``
> > + make janks less noticeable at the risk of premature OOM kills.
> 
> > +
> > +Data centers
> > +------------
> > +Data centers want to optimize job scheduling (bin packing) to improve
> > +memory utilizations. Job schedulers need to estimate whether a server
> > +can allocate a certain amount of memory for a new job, and this step
> > +is known as working set estimation, which doesn't impact the existing
> > +jobs running on this server. They also want to attempt freeing some
> > +cold memory from the existing jobs, and this step is known as proactive
> > +reclaim, which improves the chance of landing a new job successfully.
> 
> This paragraph also fits overview.

Will do.

> > +:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations
> > + for working set estimation and proactive reclaim.
> 
> Please add a note that this is build time option.

Will do.

> > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> 
> Is debugfs interface relevant only for datacenters? 

For the moment, yes.

> > + format:
> > + ::
> > +
> > +   memcg  memcg_id  memcg_path
> > +     node  node_id
> > +       min_gen  birth_time  anon_size  file_size
> > +       ...
> > +       max_gen  birth_time  anon_size  file_size
> > +
> > + ``min_gen`` is the oldest generation number and ``max_gen`` is the
> > + youngest generation number. ``birth_time`` is in milliseconds.
> 
> It's unclear what is birth_time reference point. Is it milliseconds from
> the system start or it is measured some other way?

Good point. Will clarify.

> > + ``anon_size`` and ``file_size`` are in pages. The youngest generation
> > + represents the group of the MRU pages and the oldest generation
> > + represents the group of the LRU pages. For working set estimation, a
> 
> Please spell out MRU and LRU fully.

Will do.

> > + job scheduler writes to this file at a certain time interval to
> > + create new generations, and it ranks available servers based on the
> > + sizes of their cold memory defined by this time interval. For
> > + proactive reclaim, a job scheduler writes to this file before it
> > + tries to land a new job, and if it fails to materialize the cold
> > + memory without impacting the existing jobs, it retries on the next
> > + server according to the ranking result.
> 
> Is this knob only relevant for a job scheduler? Or it can be used in other
> use-cases as well?

There are other concrete use cases but I'm not ready to discuss them
yet.

> > + This file accepts commands in the following subsections. Multiple
> 
>                               ^ described

Will do.
Mike Rapoport Feb. 21, 2022, 9:01 a.m. UTC | #4
On Tue, Feb 15, 2022 at 08:22:10PM -0700, Yu Zhao wrote:
> On Mon, Feb 14, 2022 at 12:28:56PM +0200, Mike Rapoport wrote:
> 
> > > +====== ========
> > > +Values Features
> > > +====== ========
> > > +0x0001 the multigenerational LRU
> > 
> > The multigenerational LRU what?
> 
> Itself? This depends on the POV, and I'm trying to determine what would
> be the natural way to present it.
> 
> MGLRU itself could be seen as an add-on atop the existing page reclaim
> or an alternative in parallel. The latter would be similar to sl[aou]b,
> and that's how I personally see it.
> 
> But here I presented it more like the former because I feel this way is
> more natural to users because they are like switches on a single panel.

Than I think it should be described as "enable multigenerational LRU" or
something like this.
 
> > What will happen if I write 0x2 to this file?
> 
> Just like turning on a branch breaker while leaving the main breaker
> off in a circuit breaker box. This is how I see it, and I'm totally
> fine with changing it to whatever you'd recommend.

That was my guess that when bit 0 is clear the rest do not matter :)
What's important, IMO, is that it is stated explicitly in the description.
 
> > Please consider splitting "enable" and "features" attributes.
> 
> How about s/Features/Components/?

I meant to use two attributes:

/sys/kernel/mm/lru_gen/enable for the main breaker, and
/sys/kernel/mm/lru_gen/features (or components) for the branch breakers
 
> > > +0x0002 clear the accessed bit in leaf page table entries **in large
> > > +       batches**, when MMU sets it (e.g., on x86)
> > 
> > Is extra markup really needed here...
> > 
> > > +0x0004 clear the accessed bit in non-leaf page table entries **as
> > > +       well**, when MMU sets it (e.g., on x86)
> > 
> > ... and here?
> 
> Will do.
> 
> > As for the descriptions, what is the user-visible effect of these features?
> > How different modes of clearing the access bit are reflected in, say, GUI
> > responsiveness, database TPS, or probability of OOM?
> 
> These remain to be seen :) I just added these switches in v7, per Mel's
> request from the meeting we had. These were never tested in the field.

I see :)

It would be nice to have a description or/and examples of user-visible
effects when there will be some insight on what these features do.

> > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> > 
> > Is debugfs interface relevant only for datacenters? 
> 
> For the moment, yes.

And what will happen if somebody uses these interfaces outside
datacenters? As soon as there is a sysfs intefrace, somebody will surely
play with it.

I think the job schedulers might be the most important user of that
interface, but the documentation should not presume it is the only user.
 
> > > + job scheduler writes to this file at a certain time interval to
> > > + create new generations, and it ranks available servers based on the
> > > + sizes of their cold memory defined by this time interval. For
> > > + proactive reclaim, a job scheduler writes to this file before it
> > > + tries to land a new job, and if it fails to materialize the cold
> > > + memory without impacting the existing jobs, it retries on the next
> > > + server according to the ranking result.
> > 
> > Is this knob only relevant for a job scheduler? Or it can be used in other
> > use-cases as well?
> 
> There are other concrete use cases but I'm not ready to discuss them
> yet.
 
Here as well, as soon as there is an interface it's not necessarily "job
scheduler" that will "write to this file", anybody can write to that file.
Please adjust the documentation to be more neutral regarding the use-cases.
Yu Zhao Feb. 22, 2022, 1:47 a.m. UTC | #5
On Mon, Feb 21, 2022 at 2:02 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Feb 15, 2022 at 08:22:10PM -0700, Yu Zhao wrote:
> > On Mon, Feb 14, 2022 at 12:28:56PM +0200, Mike Rapoport wrote:
> >
> > > > +====== ========
> > > > +Values Features
> > > > +====== ========
> > > > +0x0001 the multigenerational LRU
> > >
> > > The multigenerational LRU what?
> >
> > Itself? This depends on the POV, and I'm trying to determine what would
> > be the natural way to present it.
> >
> > MGLRU itself could be seen as an add-on atop the existing page reclaim
> > or an alternative in parallel. The latter would be similar to sl[aou]b,
> > and that's how I personally see it.
> >
> > But here I presented it more like the former because I feel this way is
> > more natural to users because they are like switches on a single panel.
>
> Than I think it should be described as "enable multigenerational LRU" or
> something like this.

Will do.

> > > What will happen if I write 0x2 to this file?
> >
> > Just like turning on a branch breaker while leaving the main breaker
> > off in a circuit breaker box. This is how I see it, and I'm totally
> > fine with changing it to whatever you'd recommend.
>
> That was my guess that when bit 0 is clear the rest do not matter :)
> What's important, IMO, is that it is stated explicitly in the description.

Will do.

> > > Please consider splitting "enable" and "features" attributes.
> >
> > How about s/Features/Components/?
>
> I meant to use two attributes:
>
> /sys/kernel/mm/lru_gen/enable for the main breaker, and
> /sys/kernel/mm/lru_gen/features (or components) for the branch breakers

It's a bit superfluous for my taste. I generally consider multiple
items to fall into the same category if they can be expressed by a
type of array, and I usually pack an array into a single file.

From your last review, I gauged this would be too overloaded for your
taste. So I'd be happy to make the change if you think two files look
more intuitive from user's perspective.

> > > > +0x0002 clear the accessed bit in leaf page table entries **in large
> > > > +       batches**, when MMU sets it (e.g., on x86)
> > >
> > > Is extra markup really needed here...
> > >
> > > > +0x0004 clear the accessed bit in non-leaf page table entries **as
> > > > +       well**, when MMU sets it (e.g., on x86)
> > >
> > > ... and here?
> >
> > Will do.
> >
> > > As for the descriptions, what is the user-visible effect of these features?
> > > How different modes of clearing the access bit are reflected in, say, GUI
> > > responsiveness, database TPS, or probability of OOM?
> >
> > These remain to be seen :) I just added these switches in v7, per Mel's
> > request from the meeting we had. These were never tested in the field.
>
> I see :)
>
> It would be nice to have a description or/and examples of user-visible
> effects when there will be some insight on what these features do.

How does the following sound?

Clearing the accessed bit in large batches can theoretically cause
lock contention (mmap_lock), and if it happens the 0x0002 switch can
disable this feature. In this case the multigenerational LRU suffers a
minor performance degradation.
Clearing the accessed bit in non-leaf page table entries was only
verified on Intel and AMD, and if it causes problems on other x86
varieties the 0x0004 switch can disable this feature. In this case the
multigenerational LRU suffers a negligible performance degradation.

> > > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> > >
> > > Is debugfs interface relevant only for datacenters?
> >
> > For the moment, yes.
>
> And what will happen if somebody uses these interfaces outside
> datacenters? As soon as there is a sysfs intefrace, somebody will surely
> play with it.
>
> I think the job schedulers might be the most important user of that
> interface, but the documentation should not presume it is the only user.

Other ideas are more like brainstorming than concrete use cases, e.g.,
for desktop users, these interface can in theory speed up hibernation
(suspend to disk); for VM users, they can again in theory support auto
ballooning. These niches are really minor and less explored compared
with the data center use cases which have been dominant.

I was hoping we could focus on the essential and take one step at a
time. Later on, if there is additional demand and resource, then we
expand to cover more use cases.

> > > > + job scheduler writes to this file at a certain time interval to
> > > > + create new generations, and it ranks available servers based on the
> > > > + sizes of their cold memory defined by this time interval. For
> > > > + proactive reclaim, a job scheduler writes to this file before it
> > > > + tries to land a new job, and if it fails to materialize the cold
> > > > + memory without impacting the existing jobs, it retries on the next
> > > > + server according to the ranking result.
> > >
> > > Is this knob only relevant for a job scheduler? Or it can be used in other
> > > use-cases as well?
> >
> > There are other concrete use cases but I'm not ready to discuss them
> > yet.
>
> Here as well, as soon as there is an interface it's not necessarily "job
> scheduler" that will "write to this file", anybody can write to that file.
> Please adjust the documentation to be more neutral regarding the use-cases.

Will do.
Mike Rapoport Feb. 23, 2022, 10:58 a.m. UTC | #6
On Mon, Feb 21, 2022 at 06:47:25PM -0700, Yu Zhao wrote:
> On Mon, Feb 21, 2022 at 2:02 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Tue, Feb 15, 2022 at 08:22:10PM -0700, Yu Zhao wrote:
> > > > Please consider splitting "enable" and "features" attributes.
> > >
> > > How about s/Features/Components/?
> >
> > I meant to use two attributes:
> >
> > /sys/kernel/mm/lru_gen/enable for the main breaker, and
> > /sys/kernel/mm/lru_gen/features (or components) for the branch breakers
> 
> It's a bit superfluous for my taste. I generally consider multiple
> items to fall into the same category if they can be expressed by a
> type of array, and I usually pack an array into a single file.
> 
> From your last review, I gauged this would be too overloaded for your
> taste. So I'd be happy to make the change if you think two files look
> more intuitive from user's perspective.
 
I do think that two attributes are more user-friendly, but I don't feel
strongly about it.

> > > > As for the descriptions, what is the user-visible effect of these features?
> > > > How different modes of clearing the access bit are reflected in, say, GUI
> > > > responsiveness, database TPS, or probability of OOM?
> > >
> > > These remain to be seen :) I just added these switches in v7, per Mel's
> > > request from the meeting we had. These were never tested in the field.
> >
> > I see :)
> >
> > It would be nice to have a description or/and examples of user-visible
> > effects when there will be some insight on what these features do.
> 
> How does the following sound?
> 
> Clearing the accessed bit in large batches can theoretically cause
> lock contention (mmap_lock), and if it happens the 0x0002 switch can
> disable this feature. In this case the multigenerational LRU suffers a
> minor performance degradation.
> Clearing the accessed bit in non-leaf page table entries was only
> verified on Intel and AMD, and if it causes problems on other x86
> varieties the 0x0004 switch can disable this feature. In this case the
> multigenerational LRU suffers a negligible performance degradation.
 
LGTM

> > > > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> > > >
> > > > Is debugfs interface relevant only for datacenters?
> > >
> > > For the moment, yes.
> >
> > And what will happen if somebody uses these interfaces outside
> > datacenters? As soon as there is a sysfs intefrace, somebody will surely
> > play with it.
> >
> > I think the job schedulers might be the most important user of that
> > interface, but the documentation should not presume it is the only user.
> 
> Other ideas are more like brainstorming than concrete use cases, e.g.,
> for desktop users, these interface can in theory speed up hibernation
> (suspend to disk); for VM users, they can again in theory support auto
> ballooning. These niches are really minor and less explored compared
> with the data center use cases which have been dominant.
> 
> I was hoping we could focus on the essential and take one step at a
> time. Later on, if there is additional demand and resource, then we
> expand to cover more use cases.

Apparently I was not clear :)

I didn't mean that you should describe other use-cases, I rather suggested
to make the documentation more neutral, e.g. using "a user writes to this
file ..." instead of "job scheduler writes to a file ...". Or maybe add a
sentence in the beginning of the "Data centers" section, for instance:

Data centers
------------

+ A representative example of multigenerational LRU users are job
schedulers.

Data centers want to optimize job scheduling (bin packing) to improve
memory utilizations. Job schedulers need to estimate whether a server
Yu Zhao Feb. 23, 2022, 9:20 p.m. UTC | #7
On Wed, Feb 23, 2022 at 3:58 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Mon, Feb 21, 2022 at 06:47:25PM -0700, Yu Zhao wrote:
> > On Mon, Feb 21, 2022 at 2:02 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > On Tue, Feb 15, 2022 at 08:22:10PM -0700, Yu Zhao wrote:
> > > > > Please consider splitting "enable" and "features" attributes.
> > > >
> > > > How about s/Features/Components/?
> > >
> > > I meant to use two attributes:
> > >
> > > /sys/kernel/mm/lru_gen/enable for the main breaker, and
> > > /sys/kernel/mm/lru_gen/features (or components) for the branch breakers
> >
> > It's a bit superfluous for my taste. I generally consider multiple
> > items to fall into the same category if they can be expressed by a
> > type of array, and I usually pack an array into a single file.
> >
> > From your last review, I gauged this would be too overloaded for your
> > taste. So I'd be happy to make the change if you think two files look
> > more intuitive from user's perspective.
>
> I do think that two attributes are more user-friendly, but I don't feel
> strongly about it.
>
> > > > > As for the descriptions, what is the user-visible effect of these features?
> > > > > How different modes of clearing the access bit are reflected in, say, GUI
> > > > > responsiveness, database TPS, or probability of OOM?
> > > >
> > > > These remain to be seen :) I just added these switches in v7, per Mel's
> > > > request from the meeting we had. These were never tested in the field.
> > >
> > > I see :)
> > >
> > > It would be nice to have a description or/and examples of user-visible
> > > effects when there will be some insight on what these features do.
> >
> > How does the following sound?
> >
> > Clearing the accessed bit in large batches can theoretically cause
> > lock contention (mmap_lock), and if it happens the 0x0002 switch can
> > disable this feature. In this case the multigenerational LRU suffers a
> > minor performance degradation.
> > Clearing the accessed bit in non-leaf page table entries was only
> > verified on Intel and AMD, and if it causes problems on other x86
> > varieties the 0x0004 switch can disable this feature. In this case the
> > multigenerational LRU suffers a negligible performance degradation.
>
> LGTM
>
> > > > > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> > > > >
> > > > > Is debugfs interface relevant only for datacenters?
> > > >
> > > > For the moment, yes.
> > >
> > > And what will happen if somebody uses these interfaces outside
> > > datacenters? As soon as there is a sysfs intefrace, somebody will surely
> > > play with it.
> > >
> > > I think the job schedulers might be the most important user of that
> > > interface, but the documentation should not presume it is the only user.
> >
> > Other ideas are more like brainstorming than concrete use cases, e.g.,
> > for desktop users, these interface can in theory speed up hibernation
> > (suspend to disk); for VM users, they can again in theory support auto
> > ballooning. These niches are really minor and less explored compared
> > with the data center use cases which have been dominant.
> >
> > I was hoping we could focus on the essential and take one step at a
> > time. Later on, if there is additional demand and resource, then we
> > expand to cover more use cases.
>
> Apparently I was not clear :)
>
> I didn't mean that you should describe other use-cases, I rather suggested
> to make the documentation more neutral, e.g. using "a user writes to this
> file ..." instead of "job scheduler writes to a file ...". Or maybe add a
> sentence in the beginning of the "Data centers" section, for instance:
>
> Data centers
> ------------
>
> + A representative example of multigenerational LRU users are job
> schedulers.
>
> Data centers want to optimize job scheduling (bin packing) to improve
> memory utilizations. Job schedulers need to estimate whether a server

Yes, that makes sense. Will do. Thanks.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index c21b5823f126..2cf5bae62036 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -32,6 +32,7 @@  the Linux memory management.
    idle_page_tracking
    ksm
    memory-hotplug
+   multigen_lru
    nommu-mmap
    numa_memory_policy
    numaperf
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
new file mode 100644
index 000000000000..16a543c8b886
--- /dev/null
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -0,0 +1,121 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multigenerational LRU
+=====================
+
+Quick start
+===========
+Build configurations
+--------------------
+:Required: Set ``CONFIG_LRU_GEN=y``.
+
+:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to enable the
+ multigenerational LRU by default.
+
+Runtime configurations
+----------------------
+:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if
+ ``CONFIG_LRU_GEN_ENABLED=n``.
+
+This file accepts different values to enabled or disabled the
+following features:
+
+====== ========
+Values Features
+====== ========
+0x0001 the multigenerational LRU
+0x0002 clear the accessed bit in leaf page table entries **in large
+       batches**, when MMU sets it (e.g., on x86)
+0x0004 clear the accessed bit in non-leaf page table entries **as
+       well**, when MMU sets it (e.g., on x86)
+[yYnN] apply to all the features above
+====== ========
+
+E.g.,
+::
+
+    echo y >/sys/kernel/mm/lru_gen/enabled
+    cat /sys/kernel/mm/lru_gen/enabled
+    0x0007
+    echo 5 >/sys/kernel/mm/lru_gen/enabled
+    cat /sys/kernel/mm/lru_gen/enabled
+    0x0005
+
+Most users should enable or disable all the features unless some of
+them have unforeseen side effects.
+
+Recipes
+=======
+Personal computers
+------------------
+Personal computers are more sensitive to thrashing because it can
+cause janks (lags when rendering UI) and negatively impact user
+experience. The multigenerational LRU offers thrashing prevention to
+the majority of laptop and desktop users who don't have oomd.
+
+:Thrashing prevention: Write ``N`` to
+ ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
+ ``N`` milliseconds from getting evicted. The OOM killer is triggered
+ if this working set can't be kept in memory. Based on the average
+ human detectable lag (~100ms), ``N=1000`` usually eliminates
+ intolerable janks due to thrashing. Larger values like ``N=3000``
+ make janks less noticeable at the risk of premature OOM kills.
+
+Data centers
+------------
+Data centers want to optimize job scheduling (bin packing) to improve
+memory utilizations. Job schedulers need to estimate whether a server
+can allocate a certain amount of memory for a new job, and this step
+is known as working set estimation, which doesn't impact the existing
+jobs running on this server. They also want to attempt freeing some
+cold memory from the existing jobs, and this step is known as proactive
+reclaim, which improves the chance of landing a new job successfully.
+
+:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations
+ for working set estimation and proactive reclaim.
+
+:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
+ format:
+ ::
+
+   memcg  memcg_id  memcg_path
+     node  node_id
+       min_gen  birth_time  anon_size  file_size
+       ...
+       max_gen  birth_time  anon_size  file_size
+
+ ``min_gen`` is the oldest generation number and ``max_gen`` is the
+ youngest generation number. ``birth_time`` is in milliseconds.
+ ``anon_size`` and ``file_size`` are in pages. The youngest generation
+ represents the group of the MRU pages and the oldest generation
+ represents the group of the LRU pages. For working set estimation, a
+ job scheduler writes to this file at a certain time interval to
+ create new generations, and it ranks available servers based on the
+ sizes of their cold memory defined by this time interval. For
+ proactive reclaim, a job scheduler writes to this file before it
+ tries to land a new job, and if it fails to materialize the cold
+ memory without impacting the existing jobs, it retries on the next
+ server according to the ranking result.
+
+ This file accepts commands in the following subsections. Multiple
+ command lines are supported, so does concatenation with delimiters
+ ``,`` and ``;``.
+
+ ``/sys/kernel/debug/lru_gen_full`` contains additional stats for
+ debugging.
+
+:Working set estimation: Write ``+ memcg_id node_id max_gen
+ [can_swap [full_scan]]`` to ``/sys/kernel/debug/lru_gen`` to invoke
+ the aging. It scans PTEs for hot pages and promotes them to the
+ youngest generation ``max_gen``. Then it creates a new generation
+ ``max_gen+1``. Set ``can_swap`` to ``1`` to scan for hot anon pages
+ when swap is off. Set ``full_scan`` to ``0`` to reduce the overhead
+ as well as the coverage when scanning PTEs.
+
+:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness
+ [nr_to_reclaim]]`` to ``/sys/kernel/debug/lru_gen`` to invoke the
+ eviction. It evicts generations less than or equal to ``min_gen``.
+ ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and
+ ``max_gen-1`` aren't fully aged and therefore can't be evicted. Use
+ ``nr_to_reclaim`` to limit the number of pages to evict.
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index 44365c4574a3..b48434300226 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -25,6 +25,7 @@  algorithms.  If you are looking for advice on simply allocating memory, see the
    ksm
    memory-model
    mmu_notifier
+   multigen_lru
    numa
    overcommit-accounting
    page_migration
diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
new file mode 100644
index 000000000000..42a277b4e74b
--- /dev/null
+++ b/Documentation/vm/multigen_lru.rst
@@ -0,0 +1,152 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multigenerational LRU
+=====================
+
+Design overview
+===============
+The design objectives are:
+
+* Good representation of access recency
+* Try to profit from spatial locality
+* Fast paths to make obvious choices
+* Simple self-correcting heuristics
+
+The representation of access recency is at the core of all LRU
+implementations. In the multigenerational LRU, each generation
+represents a group of pages with similar access recency (a timestamp).
+Generations establish a common frame of reference and therefore help
+make better choices, e.g., between different memcgs on a computer or
+different computers in a data center (for job scheduling).
+
+Exploiting spatial locality improves the efficiency when gathering the
+accessed bit. A rmap walk targets a single page and doesn't try to
+profit from discovering a young PTE. A page table walk can sweep all
+the young PTEs in an address space, but its search space can be too
+large to make a profit. The key is to optimize both methods and use
+them in combination.
+
+Fast paths reduce code complexity and runtime overhead. Unmapped pages
+don't require TLB flushes; clean pages don't require writeback. These
+facts are only helpful when other conditions, e.g., access recency,
+are similar. With generations as a common frame of reference,
+additional factors stand out. But obvious choices might not be good
+choices; thus self-correction is required.
+
+The benefits of simple self-correcting heuristics are self-evident.
+Again, with generations as a common frame of reference, this becomes
+attainable. Specifically, pages in the same generation are categorized
+based on additional factors, and a feedback loop statistically
+compares the refault percentages across those categories and infers
+which of them are better choices.
+
+The protection of hot pages and the selection of cold pages are based
+on page access channels and patterns. There are two access channels:
+
+* Accesses through page tables
+* Accesses through file descriptors
+
+The protection of the former channel is by design stronger because:
+
+1. The uncertainty in determining the access patterns of the former
+   channel is higher due to the approximation of the accessed bit.
+2. The cost of evicting the former channel is higher due to the TLB
+   flushes required and the likelihood of encountering the dirty bit.
+3. The penalty of underprotecting the former channel is higher because
+   applications usually don't prepare themselves for major page faults
+   like they do for blocked I/O. E.g., GUI applications commonly use
+   dedicated I/O threads to avoid blocking the rendering threads.
+
+There are also two access patterns:
+
+* Accesses exhibiting temporal locality
+* Accesses not exhibiting temporal locality
+
+For the reasons listed above, the former channel is assumed to follow
+the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
+present, and the latter channel is assumed to follow the latter
+pattern unless outlying refaults have been observed.
+
+Workflow overview
+=================
+Evictable pages are divided into multiple generations for each
+``lruvec``. The youngest generation number is stored in
+``lrugen->max_seq`` for both anon and file types as they are aged on
+an equal footing. The oldest generation numbers are stored in
+``lrugen->min_seq[]`` separately for anon and file types as clean
+file pages can be evicted regardless of swap constraints. These three
+variables are monotonically increasing.
+
+Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
+bits in order to fit into the gen counter in ``folio->flags``. Each
+truncated generation number is an index to ``lrugen->lists[]``. The
+sliding window technique is used to track at least ``MIN_NR_GENS`` and
+at most ``MAX_NR_GENS`` generations. The gen counter stores
+``(seq%MAX_NR_GENS)+1`` while a page is on one of ``lrugen->lists[]``;
+otherwise it stores zero.
+
+Each generation is divided into multiple tiers. Tiers represent
+different ranges of numbers of accesses through file descriptors.
+A page accessed ``N`` times through file descriptors is in tier
+``order_base_2(N)``. In contrast to moving across generations which
+requires the LRU lock, moving across tiers only requires operations on
+``folio->flags`` and therefore has a negligible cost. A feedback loop
+modeled after the PID controller monitors refaults over all the tiers
+from anon and file types and decides which tiers from which types to
+evict or promote.
+
+There are two conceptually independent processes (as in the
+manufacturing process): the aging and the eviction. They form a
+closed-loop system, i.e., the page reclaim.
+
+Aging
+-----
+The aging produces young generations. Given an ``lruvec``, it
+increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
+``MIN_NR_GENS``. The aging promotes hot pages to the youngest
+generation when it finds them accessed through page tables; the
+demotion of cold pages happens consequently when it increments
+``max_seq``. The aging uses page table walks and rmap walks to find
+young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
+and calls ``walk_page_range()`` with each ``mm_struct`` on this list
+to scan PTEs. On finding a young PTE, it clears the accessed bit and
+updates the gen counter of the page mapped by this PTE to
+``(max_seq%MAX_NR_GENS)+1``. After each iteration of this list, it
+increments ``max_seq``. For the latter, when the eviction walks the
+rmap and finds a young PTE, the aging scans the adjacent PTEs and
+follows the same steps.
+
+Eviction
+--------
+The eviction consumes old generations. Given an ``lruvec``, it
+increments ``min_seq`` when ``lrugen->lists[]`` indexed by
+``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
+evict from, it first compares ``min_seq[]`` to select the older type.
+If they are equal, it selects the type whose first tier has a lower
+refault percentage. The first tier contains single-use unmapped clean
+pages, which are the best bet. The eviction sorts a page according to
+the gen counter if the aging has found this page accessed through page
+tables and updated the gen counter. It also promotes a page to the
+next generation, i.e., ``min_seq+1`` rather than ``max_seq``, if this
+page was accessed multiple times through file descriptors and the
+feedback loop has detected outlying refaults from the tier this page
+is in, using the first tier as a baseline.
+
+Summary
+-------
+The multigenerational LRU can be disassembled into the following
+components:
+
+* Generations
+* Page table walks
+* Rmap walks
+* Bloom filters
+* PID controller
+
+Between the aging and the eviction (processes), the latter drives the
+former by the sliding window over generations. Within the aging, rmap
+walks drive page table walks by inserting hot dense page tables to the
+Bloom filters. Within the eviction, the PID controller uses refaults
+as the feedback to turn on or off the eviction of certain types and
+tiers.