Message ID | 20220104202227.2903605-9-yuzhao@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Multigenerational LRU Framework | expand |
Hi, On Tue, Jan 04, 2022 at 01:22:27PM -0700, Yu Zhao wrote: > Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch. > > Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention. > Compared with the size-based approach, e.g., [1], this time-based > approach has the following advantages: > 1) It's easier to configure because it's agnostic to applications and > memory sizes. > 2) It's more reliable because it's directly wired to the OOM killer. > > Add /sys/kernel/debug/lru_gen for working set estimation and proactive > reclaim. Compared with the page table-based approach and the PFN-based > approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has > the following advantages: > 1) It offers better choices because it's aware of memcgs, NUMA nodes, > shared mappings and unmapped page cache. > 2) It's more scalable because it's O(nr_hot_evictable_pages), whereas > the PFN-based approach is O(nr_total_pages). > > Add /sys/kernel/debug/lru_gen_full for debugging. > > [1] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/ > > Signed-off-by: Yu Zhao <yuzhao@google.com> > Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> > --- > Documentation/vm/index.rst | 1 + > Documentation/vm/multigen_lru.rst | 62 +++++ The description of user visible interfaces should go to Documentation/admin-guide/mm Documentation/vm/multigen_lru.rst should have contained design description and the implementation details and it would be great to actually have such document. > include/linux/nodemask.h | 1 + > mm/vmscan.c | 415 ++++++++++++++++++++++++++++++ > 4 files changed, 479 insertions(+) > create mode 100644 Documentation/vm/multigen_lru.rst > > diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst > index 6f5ffef4b716..f25e755b4ff4 100644 > --- a/Documentation/vm/index.rst > +++ b/Documentation/vm/index.rst > @@ -38,3 +38,4 @@ algorithms. If you are looking for advice on simply allocating memory, see the > unevictable-lru > z3fold > zsmalloc > + multigen_lru > diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst > new file mode 100644 > index 000000000000..6f9e0181348b > --- /dev/null > +++ b/Documentation/vm/multigen_lru.rst > @@ -0,0 +1,62 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +===================== > +Multigenerational LRU > +===================== > + > +Quick start > +=========== > +Runtime configurations > +---------------------- > +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the > + feature wasn't enabled by default. Required for what? This sentence seem to lack context. Maybe add an overview what is Multigenerational LRU so that users will have an idea what these knobs control. > + > +Recipes > +======= Some more context here will be also helpful. > +Personal computers > +------------------ > +:Thrashing prevention: Write ``N`` to > + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of > + ``N`` milliseconds from getting evicted. The OOM killer is invoked if > + this working set can't be kept in memory. Based on the average human > + detectable lag (~100ms), ``N=1000`` usually eliminates intolerable > + lags due to thrashing. Larger values like ``N=3000`` make lags less > + noticeable at the cost of more OOM kills. > + > +Data centers > +------------ > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following > + format: > + :: > + > + memcg memcg_id memcg_path > + node node_id > + min_gen birth_time anon_size file_size > + ... > + max_gen birth_time anon_size file_size > + > + ``min_gen`` is the oldest generation number and ``max_gen`` is the > + youngest generation number. ``birth_time`` is in milliseconds. > + ``anon_size`` and ``file_size`` are in pages. And what does oldest and youngest generations mean from the user perspective? > + > + This file also accepts commands in the following subsections. > + Multiple command lines are supported, so does concatenation with > + delimiters ``,`` and ``;``. > + > + ``/sys/kernel/debug/lru_gen_full`` contains additional stats for > + debugging. > + > +:Working set estimation: Write ``+ memcg_id node_id max_gen > + [can_swap [full_scan]]`` to ``/sys/kernel/debug/lru_gen`` to trigger > + the aging. It scans PTEs for accessed pages and promotes them to the > + youngest generation ``max_gen``. Then it creates a new generation > + ``max_gen+1``. Set ``can_swap`` to 1 to scan for accessed anon pages > + when swap is off. Set ``full_scan`` to 0 to reduce the overhead as > + well as the coverage when scanning PTEs. > + > +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness > + [nr_to_reclaim]]`` to ``/sys/kernel/debug/lru_gen`` to trigger the > + eviction. It evicts generations less than or equal to ``min_gen``. > + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and > + ``max_gen-1`` aren't fully aged and therefore can't be evicted. Use > + ``nr_to_reclaim`` to limit the number of pages to evict. ...
On Mon, Jan 10, 2022 at 12:27:19PM +0200, Mike Rapoport wrote: > Hi, > > On Tue, Jan 04, 2022 at 01:22:27PM -0700, Yu Zhao wrote: > > Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch. > > > > Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention. > > Compared with the size-based approach, e.g., [1], this time-based > > approach has the following advantages: > > 1) It's easier to configure because it's agnostic to applications and > > memory sizes. > > 2) It's more reliable because it's directly wired to the OOM killer. > > > > Add /sys/kernel/debug/lru_gen for working set estimation and proactive > > reclaim. Compared with the page table-based approach and the PFN-based > > approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has > > the following advantages: > > 1) It offers better choices because it's aware of memcgs, NUMA nodes, > > shared mappings and unmapped page cache. > > 2) It's more scalable because it's O(nr_hot_evictable_pages), whereas > > the PFN-based approach is O(nr_total_pages). > > > > Add /sys/kernel/debug/lru_gen_full for debugging. > > > > [1] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/ > > > > Signed-off-by: Yu Zhao <yuzhao@google.com> > > Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> > > --- > > Documentation/vm/index.rst | 1 + > > Documentation/vm/multigen_lru.rst | 62 +++++ > > The description of user visible interfaces should go to > Documentation/admin-guide/mm > > Documentation/vm/multigen_lru.rst should have contained design description > and the implementation details and it would be great to actually have such > document. Will do, thanks. > > include/linux/nodemask.h | 1 + > > mm/vmscan.c | 415 ++++++++++++++++++++++++++++++ > > 4 files changed, 479 insertions(+) > > create mode 100644 Documentation/vm/multigen_lru.rst > > > > diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst > > index 6f5ffef4b716..f25e755b4ff4 100644 > > --- a/Documentation/vm/index.rst > > +++ b/Documentation/vm/index.rst > > @@ -38,3 +38,4 @@ algorithms. If you are looking for advice on simply allocating memory, see the > > unevictable-lru > > z3fold > > zsmalloc > > + multigen_lru > > diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst > > new file mode 100644 > > index 000000000000..6f9e0181348b > > --- /dev/null > > +++ b/Documentation/vm/multigen_lru.rst > > @@ -0,0 +1,62 @@ > > +.. SPDX-License-Identifier: GPL-2.0 > > + > > +===================== > > +Multigenerational LRU > > +===================== > > + > > +Quick start > > +=========== > > +Runtime configurations > > +---------------------- > > +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the > > + feature wasn't enabled by default. > > Required for what? This sentence seem to lack context. Maybe add an > overview what is Multigenerational LRU so that users will have an idea what > these knobs control. Apparently I left an important part of this quick start in the next patch, where Kconfig options are added. I'm wonder whether I should squash the next patch into this one. I always separate Kconfig changes and leave them in the last patch because it gives me peace of mind knowing it'll never give any auto bisectors a hard time. But I saw people not following this practice, and I'm also tempted to do so. Can anybody remind me whether it's considered a bad practice to have code changes and Kconfig changes in the same patch? > > + > > +Recipes > > +======= > > Some more context here will be also helpful. Will do. > > +Personal computers > > +------------------ > > +:Thrashing prevention: Write ``N`` to > > + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of > > + ``N`` milliseconds from getting evicted. The OOM killer is invoked if > > + this working set can't be kept in memory. Based on the average human > > + detectable lag (~100ms), ``N=1000`` usually eliminates intolerable > > + lags due to thrashing. Larger values like ``N=3000`` make lags less > > + noticeable at the cost of more OOM kills. > > + > > +Data centers > > +------------ > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following > > + format: > > + :: > > + > > + memcg memcg_id memcg_path > > + node node_id > > + min_gen birth_time anon_size file_size > > + ... > > + max_gen birth_time anon_size file_size > > + > > + ``min_gen`` is the oldest generation number and ``max_gen`` is the > > + youngest generation number. ``birth_time`` is in milliseconds. > > + ``anon_size`` and ``file_size`` are in pages. > > And what does oldest and youngest generations mean from the user > perspective? Good question. Will add more details in the next spin.
On Wed 12-01-22 01:35:52, Yu Zhao wrote: [...] > But I saw people not following this practice, and I'm also tempted to > do so. Can anybody remind me whether it's considered a bad practice to > have code changes and Kconfig changes in the same patch? If you want to have the patch series bisectable then it is preferable to add kconfig options early so that the code is enabled in the respective steps. Sometimes that can be impractical though (e.g. when the feature is incomplete at that stage).
On Wed, Jan 12, 2022 at 01:35:52AM -0700, Yu Zhao wrote: > On Mon, Jan 10, 2022 at 12:27:19PM +0200, Mike Rapoport wrote: > > Hi, > > > > On Tue, Jan 04, 2022 at 01:22:27PM -0700, Yu Zhao wrote: > > > Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch. > > > > > > Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention. > > > Compared with the size-based approach, e.g., [1], this time-based > > > approach has the following advantages: > > > 1) It's easier to configure because it's agnostic to applications and > > > memory sizes. > > > 2) It's more reliable because it's directly wired to the OOM killer. > > > > > > Add /sys/kernel/debug/lru_gen for working set estimation and proactive > > > reclaim. Compared with the page table-based approach and the PFN-based > > > approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has > > > the following advantages: > > > 1) It offers better choices because it's aware of memcgs, NUMA nodes, > > > shared mappings and unmapped page cache. > > > 2) It's more scalable because it's O(nr_hot_evictable_pages), whereas > > > the PFN-based approach is O(nr_total_pages). > > > > > > Add /sys/kernel/debug/lru_gen_full for debugging. > > > > > > [1] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/ > > > > > > Signed-off-by: Yu Zhao <yuzhao@google.com> > > > Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> > > > --- > > > Documentation/vm/index.rst | 1 + > > > Documentation/vm/multigen_lru.rst | 62 +++++ > > > > The description of user visible interfaces should go to > > Documentation/admin-guide/mm > > > > Documentation/vm/multigen_lru.rst should have contained design description > > and the implementation details and it would be great to actually have such > > document. > > Will do, thanks. > > > > include/linux/nodemask.h | 1 + > > > mm/vmscan.c | 415 ++++++++++++++++++++++++++++++ > > > 4 files changed, 479 insertions(+) > > > create mode 100644 Documentation/vm/multigen_lru.rst > > > > > > diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst > > > index 6f5ffef4b716..f25e755b4ff4 100644 > > > --- a/Documentation/vm/index.rst > > > +++ b/Documentation/vm/index.rst > > > @@ -38,3 +38,4 @@ algorithms. If you are looking for advice on simply allocating memory, see the > > > unevictable-lru > > > z3fold > > > zsmalloc > > > + multigen_lru > > > diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst > > > new file mode 100644 > > > index 000000000000..6f9e0181348b > > > --- /dev/null > > > +++ b/Documentation/vm/multigen_lru.rst > > > @@ -0,0 +1,62 @@ > > > +.. SPDX-License-Identifier: GPL-2.0 > > > + > > > +===================== > > > +Multigenerational LRU > > > +===================== > > > + > > > +Quick start > > > +=========== > > > +Runtime configurations > > > +---------------------- > > > +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the > > > + feature wasn't enabled by default. > > > > Required for what? This sentence seem to lack context. Maybe add an > > overview what is Multigenerational LRU so that users will have an idea what > > these knobs control. > > Apparently I left an important part of this quick start in the next > patch, where Kconfig options are added. I'm wonder whether I should > squash the next patch into this one. I think documentation deserves a separate patch.
On Wed, Jan 12, 2022 at 05:45:40PM +0200, Mike Rapoport wrote: > On Wed, Jan 12, 2022 at 01:35:52AM -0700, Yu Zhao wrote: > > On Mon, Jan 10, 2022 at 12:27:19PM +0200, Mike Rapoport wrote: > > > Hi, > > > > > > On Tue, Jan 04, 2022 at 01:22:27PM -0700, Yu Zhao wrote: > > > > Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch. > > > > > > > > Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention. > > > > Compared with the size-based approach, e.g., [1], this time-based > > > > approach has the following advantages: > > > > 1) It's easier to configure because it's agnostic to applications and > > > > memory sizes. > > > > 2) It's more reliable because it's directly wired to the OOM killer. > > > > > > > > Add /sys/kernel/debug/lru_gen for working set estimation and proactive > > > > reclaim. Compared with the page table-based approach and the PFN-based > > > > approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has > > > > the following advantages: > > > > 1) It offers better choices because it's aware of memcgs, NUMA nodes, > > > > shared mappings and unmapped page cache. > > > > 2) It's more scalable because it's O(nr_hot_evictable_pages), whereas > > > > the PFN-based approach is O(nr_total_pages). > > > > > > > > Add /sys/kernel/debug/lru_gen_full for debugging. > > > > > > > > [1] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/ > > > > > > > > Signed-off-by: Yu Zhao <yuzhao@google.com> > > > > Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> > > > > --- > > > > Documentation/vm/index.rst | 1 + > > > > Documentation/vm/multigen_lru.rst | 62 +++++ > > > > > > The description of user visible interfaces should go to > > > Documentation/admin-guide/mm > > > > > > Documentation/vm/multigen_lru.rst should have contained design description > > > and the implementation details and it would be great to actually have such > > > document. > > > > Will do, thanks. > > > > > > include/linux/nodemask.h | 1 + > > > > mm/vmscan.c | 415 ++++++++++++++++++++++++++++++ > > > > 4 files changed, 479 insertions(+) > > > > create mode 100644 Documentation/vm/multigen_lru.rst > > > > > > > > diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst > > > > index 6f5ffef4b716..f25e755b4ff4 100644 > > > > --- a/Documentation/vm/index.rst > > > > +++ b/Documentation/vm/index.rst > > > > @@ -38,3 +38,4 @@ algorithms. If you are looking for advice on simply allocating memory, see the > > > > unevictable-lru > > > > z3fold > > > > zsmalloc > > > > + multigen_lru > > > > diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst > > > > new file mode 100644 > > > > index 000000000000..6f9e0181348b > > > > --- /dev/null > > > > +++ b/Documentation/vm/multigen_lru.rst > > > > @@ -0,0 +1,62 @@ > > > > +.. SPDX-License-Identifier: GPL-2.0 > > > > + > > > > +===================== > > > > +Multigenerational LRU > > > > +===================== > > > > + > > > > +Quick start > > > > +=========== > > > > +Runtime configurations > > > > +---------------------- > > > > +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the > > > > + feature wasn't enabled by default. > > > > > > Required for what? This sentence seem to lack context. Maybe add an > > > overview what is Multigenerational LRU so that users will have an idea what > > > these knobs control. > > > > Apparently I left an important part of this quick start in the next > > patch, where Kconfig options are added. I'm wonder whether I should > > squash the next patch into this one. > > I think documentation deserves a separate patch. Will do.
Yu Zhao <yuzhao@google.com> writes:
> Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch.
Got the below lockdep warning while using the above kill/enable switch
[ 84.252952] ======================================================
[ 84.253012] WARNING: possible circular locking dependency detected
[ 84.253074] 5.16.0-rc8-16204-g1cdcf1120b31 #511 Not tainted
[ 84.253135] ------------------------------------------------------
[ 84.253194] bash/2862 is trying to acquire lock:
[ 84.253243] c0000000021ff740 (cgroup_mutex){+.+.}-{3:3}, at: store_enable+0x80/0x1510
[ 84.253340]
but task is already holding lock:
[ 84.253410] c000000002221348 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x30/0x50
[ 84.253503]
which lock already depends on the new lock.
[ 84.253608]
the existing dependency chain (in reverse order) is:
[ 84.253693]
-> #2 (mem_hotplug_lock){++++}-{0:0}:
[ 84.253768] lock_acquire+0x134/0x4a0
[ 84.253821] percpu_down_write+0x80/0x1c0
[ 84.253872] try_online_node+0x40/0x90
[ 84.253924] cpu_up+0x7c/0x160
[ 84.253976] bringup_nonboot_cpus+0xc4/0x120
[ 84.254027] smp_init+0x48/0xd4
[ 84.254079] kernel_init_freeable+0x274/0x45c
[ 84.254134] kernel_init+0x44/0x194
[ 84.254188] ret_from_kernel_thread+0x5c/0x64
[ 84.254241]
-> #1 (cpu_hotplug_lock){++++}-{0:0}:
[ 84.254321] lock_acquire+0x134/0x4a0
[ 84.254373] cpus_read_lock+0x6c/0x180
[ 84.254426] static_key_disable+0x24/0x50
[ 84.254477] rebind_subsystems+0x3b0/0x5a0
[ 84.254528] cgroup_setup_root+0x24c/0x530
[ 84.254581] cgroup1_get_tree+0x7d8/0xb80
[ 84.254638] vfs_get_tree+0x48/0x150
[ 84.254695] path_mount+0x8b8/0xd20
[ 84.254752] do_mount+0xb8/0xe0
[ 84.254808] sys_mount+0x250/0x390
[ 84.254863] system_call_exception+0x15c/0x2b0
[ 84.254932] system_call_common+0xec/0x250
[ 84.254989]
-> #0 (cgroup_mutex){+.+.}-{3:3}:
[ 84.255072] check_prev_add+0x180/0x1050
[ 84.255129] __lock_acquire+0x17b8/0x25c0
[ 84.255186] lock_acquire+0x134/0x4a0
[ 84.255243] __mutex_lock+0xdc/0xa90
[ 84.255300] store_enable+0x80/0x1510
[ 84.255356] kobj_attr_store+0x2c/0x50
[ 84.255413] sysfs_kf_write+0x6c/0xb0
[ 84.255471] kernfs_fop_write_iter+0x1bc/0x2b0
[ 84.255539] new_sync_write+0x130/0x1d0
[ 84.255594] vfs_write+0x2cc/0x4c0
[ 84.255645] ksys_write+0x84/0x140
[ 84.255699] system_call_exception+0x15c/0x2b0
[ 84.255771] system_call_common+0xec/0x250
[ 84.255829]
other info that might help us debug this:
[ 84.255933] Chain exists of:
cgroup_mutex --> cpu_hotplug_lock --> mem_hotplug_lock
[ 84.256070] Possible unsafe locking scenario:
[ 84.256149] CPU0 CPU1
[ 84.256201] ---- ----
[ 84.256255] lock(mem_hotplug_lock);
[ 84.256311] lock(cpu_hotplug_lock);
[ 84.256380] lock(mem_hotplug_lock);
[ 84.256448] lock(cgroup_mutex);
[ 84.256491]
*** DEADLOCK ***
[ 84.256571] 5 locks held by bash/2862:
[ 84.256626] #0: c00000002043d460 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x84/0x140
[ 84.256728] #1: c00000004bafc888 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x178/0x2b0
[ 84.256830] #2: c000000020b993b8 (kn->active#207){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x184/0x2b0
[ 84.256942] #3: c0000000020e5cd0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x20/0x50
[ 84.257045] #4: c000000002221348 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x30/0x50
[ 84.257152]
stack backtrace:
[ 84.257220] CPU: 107 PID: 2862 Comm: bash Not tainted 5.16.0-rc8-16204-g1cdcf1120b31 #511
[ 84.257309] Call Trace:
[ 84.257346] [c000000040d5b4a0] [c000000000a89f94] dump_stack_lvl+0x98/0xe0 (unreliable)
[ 84.257438] [c000000040d5b4e0] [c000000000267244] print_circular_bug.isra.0+0x3b4/0x3e0
[ 84.257528] [c000000040d5b580] [c0000000002673e0] check_noncircular+0x170/0x1a0
[ 84.257605] [c000000040d5b650] [c000000000268be0] check_prev_add+0x180/0x1050
[ 84.257683] [c000000040d5b710] [c00000000026ca48] __lock_acquire+0x17b8/0x25c0
[ 84.257760] [c000000040d5b840] [c00000000026e4c4] lock_acquire+0x134/0x4a0
[ 84.257837] [c000000040d5b940] [c00000000148a53c] __mutex_lock+0xdc/0xa90
[ 84.257914] [c000000040d5ba60] [c0000000004d5080] store_enable+0x80/0x1510
[ 84.257989] [c000000040d5bbc0] [c000000000a9286c] kobj_attr_store+0x2c/0x50
[ 84.258066] [c000000040d5bbe0] [c000000000752c4c] sysfs_kf_write+0x6c/0xb0
[ 84.258143] [c000000040d5bc20] [c000000000750fcc] kernfs_fop_write_iter+0x1bc/0x2b0
[ 84.258219] [c000000040d5bc70] [c000000000615df0] new_sync_write+0x130/0x1d0
[ 84.258295] [c000000040d5bd10] [c00000000061997c] vfs_write+0x2cc/0x4c0
[ 84.258373] [c000000040d5bd60] [c000000000619d54] ksys_write+0x84/0x140
[ 84.258450] [c000000040d5bdb0] [c00000000002c91c] system_call_exception+0x15c/0x2b0
[ 84.258528] [c000000040d5be10] [c00000000000c64c] system_call_common+0xec/0x250
[ 84.258604] --- interrupt: c00 at 0x79c551e76554
[ 84.258658] NIP: 000079c551e76554 LR: 000079c551de2674 CTR: 0000000000000000
[ 84.258732] REGS: c000000040d5be80 TRAP: 0c00 Not tainted (5.16.0-rc8-16204-g1cdcf1120b31)
[ 84.258817] MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 28422428 XER: 00000000
[ 84.258931] IRQMASK: 0
GPR00: 0000000000000004 00007fffc8e9a320 000079c551f77100 0000000000000001
GPR04: 0000017190973cc0 0000000000000002 0000000000000010 0000017190973cc0
GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 000079c5520ab1d0 0000017190943590 000001718749b738
GPR16: 00000171873b0ae0 0000000000000000 0000000020000000 0000017190973a60
GPR20: 0000000000000000 0000000000000001 0000017187443ca0 00007fffc8e9a514
GPR24: 00007fffc8e9a510 000001718749b0d0 000079c551f719d8 000079c551f72308
GPR28: 0000000000000002 000079c551f717e8 0000017190973cc0 0000000000000002
[ 84.259600] NIP [000079c551e76554] 0x79c551e76554
[ 84.259651] LR [000079c551de2674] 0x79c551de2674
[ 84.259701] --- interrupt: c00
On Thu, Jan 13, 2022 at 04:01:31PM +0530, Aneesh Kumar K.V wrote: > Yu Zhao <yuzhao@google.com> writes: > > > Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch. > > > Got the below lockdep warning while using the above kill/enable switch > > > [ 84.252952] ====================================================== > [ 84.253012] WARNING: possible circular locking dependency detected > [ 84.253074] 5.16.0-rc8-16204-g1cdcf1120b31 #511 Not tainted > [ 84.253135] ------------------------------------------------------ > [ 84.253194] bash/2862 is trying to acquire lock: > [ 84.253243] c0000000021ff740 (cgroup_mutex){+.+.}-{3:3}, at: store_enable+0x80/0x1510 > [ 84.253340] > but task is already holding lock: > [ 84.253410] c000000002221348 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x30/0x50 > [ 84.253503] > which lock already depends on the new lock. > > [ 84.255933] Chain exists of: > cgroup_mutex --> cpu_hotplug_lock --> mem_hotplug_lock Thanks. Will reverse the order between mem_hotplug_lock and cgroup_mutex in the next spin.
Yu Zhao <yuzhao@google.com> writes: > On Thu, Jan 13, 2022 at 04:01:31PM +0530, Aneesh Kumar K.V wrote: >> Yu Zhao <yuzhao@google.com> writes: >> >> > Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch. >> >> >> Got the below lockdep warning while using the above kill/enable switch >> >> >> [ 84.252952] ====================================================== >> [ 84.253012] WARNING: possible circular locking dependency detected >> [ 84.253074] 5.16.0-rc8-16204-g1cdcf1120b31 #511 Not tainted >> [ 84.253135] ------------------------------------------------------ >> [ 84.253194] bash/2862 is trying to acquire lock: >> [ 84.253243] c0000000021ff740 (cgroup_mutex){+.+.}-{3:3}, at: store_enable+0x80/0x1510 >> [ 84.253340] >> but task is already holding lock: >> [ 84.253410] c000000002221348 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x30/0x50 >> [ 84.253503] >> which lock already depends on the new lock. >> >> [ 84.255933] Chain exists of: >> cgroup_mutex --> cpu_hotplug_lock --> mem_hotplug_lock > > Thanks. Will reverse the order between mem_hotplug_lock and > cgroup_mutex in the next spin. It also needs the unlocked variant of static_key_enable/disable. [ 71.204397][ T2819] bash/2819 is trying to acquire lock: [ 71.204446][ T2819] c0000000020e5cd0 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_disable+0x24/0x50 [ 71.204542][ T2819] [ 71.204542][ T2819] but task is already holding lock: [ 71.204613][ T2819] c0000000020e5cd0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x20/0x50 [ 71.204710][ T2819] [ 71.204710][ T2819] other info that might help us debug this: [ 71.204787][ T2819] Possible unsafe locking scenario: [ 71.204787][ T2819] [ 71.204860][ T2819] CPU0 [ 71.204901][ T2819] ---- [ 71.204941][ T2819] lock(cpu_hotplug_lock); [ 71.204998][ T2819] lock(cpu_hotplug_lock); [ 71.205053][ T2819] [ 71.205053][ T2819] *** DEADLOCK *** -aneesh
On Fri, Jan 14, 2022 at 10:50:05AM +0530, Aneesh Kumar K.V wrote: > Yu Zhao <yuzhao@google.com> writes: > > On Thu, Jan 13, 2022 at 04:01:31PM +0530, Aneesh Kumar K.V wrote: > >> Yu Zhao <yuzhao@google.com> writes: > >> > >> > Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch. > >> > >> Got the below lockdep warning while using the above kill/enable switch > >> > >> > >> [ 84.252952] ====================================================== > >> [ 84.253012] WARNING: possible circular locking dependency detected > >> [ 84.253074] 5.16.0-rc8-16204-g1cdcf1120b31 #511 Not tainted > >> [ 84.253135] ------------------------------------------------------ > >> [ 84.253194] bash/2862 is trying to acquire lock: > >> [ 84.253243] c0000000021ff740 (cgroup_mutex){+.+.}-{3:3}, at: store_enable+0x80/0x1510 > >> [ 84.253340] > >> but task is already holding lock: > >> [ 84.253410] c000000002221348 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x30/0x50 > >> [ 84.253503] > >> which lock already depends on the new lock. > >> > >> [ 84.255933] Chain exists of: > >> cgroup_mutex --> cpu_hotplug_lock --> mem_hotplug_lock > > > > Thanks. Will reverse the order between mem_hotplug_lock and > > cgroup_mutex in the next spin. > > It also needs the unlocked variant of static_key_enable/disable. Right. This is what I have at the moment. Tested with QEMU memory hotplug. Can you please give it try too? Thanks. cgroup_lock() cpus_read_lock() get_online_mems() if (enable) static_branch_enable_cpuslocked() else static_branch_disable_cpuslocked() put_online_mems() cpus_read_unlock() cgroup_unlock()
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index 6f5ffef4b716..f25e755b4ff4 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -38,3 +38,4 @@ algorithms. If you are looking for advice on simply allocating memory, see the unevictable-lru z3fold zsmalloc + multigen_lru diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst new file mode 100644 index 000000000000..6f9e0181348b --- /dev/null +++ b/Documentation/vm/multigen_lru.rst @@ -0,0 +1,62 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Multigenerational LRU +===================== + +Quick start +=========== +Runtime configurations +---------------------- +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the + feature wasn't enabled by default. + +Recipes +======= +Personal computers +------------------ +:Thrashing prevention: Write ``N`` to + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of + ``N`` milliseconds from getting evicted. The OOM killer is invoked if + this working set can't be kept in memory. Based on the average human + detectable lag (~100ms), ``N=1000`` usually eliminates intolerable + lags due to thrashing. Larger values like ``N=3000`` make lags less + noticeable at the cost of more OOM kills. + +Data centers +------------ +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following + format: + :: + + memcg memcg_id memcg_path + node node_id + min_gen birth_time anon_size file_size + ... + max_gen birth_time anon_size file_size + + ``min_gen`` is the oldest generation number and ``max_gen`` is the + youngest generation number. ``birth_time`` is in milliseconds. + ``anon_size`` and ``file_size`` are in pages. + + This file also accepts commands in the following subsections. + Multiple command lines are supported, so does concatenation with + delimiters ``,`` and ``;``. + + ``/sys/kernel/debug/lru_gen_full`` contains additional stats for + debugging. + +:Working set estimation: Write ``+ memcg_id node_id max_gen + [can_swap [full_scan]]`` to ``/sys/kernel/debug/lru_gen`` to trigger + the aging. It scans PTEs for accessed pages and promotes them to the + youngest generation ``max_gen``. Then it creates a new generation + ``max_gen+1``. Set ``can_swap`` to 1 to scan for accessed anon pages + when swap is off. Set ``full_scan`` to 0 to reduce the overhead as + well as the coverage when scanning PTEs. + +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness + [nr_to_reclaim]]`` to ``/sys/kernel/debug/lru_gen`` to trigger the + eviction. It evicts generations less than or equal to ``min_gen``. + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and + ``max_gen-1`` aren't fully aged and therefore can't be evicted. Use + ``nr_to_reclaim`` to limit the number of pages to evict. diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index 567c3ddba2c4..90840c459abc 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -486,6 +486,7 @@ static inline int num_node_state(enum node_states state) #define first_online_node 0 #define first_memory_node 0 #define next_online_node(nid) (MAX_NUMNODES) +#define next_memory_node(nid) (MAX_NUMNODES) #define nr_node_ids 1U #define nr_online_nodes 1U diff --git a/mm/vmscan.c b/mm/vmscan.c index b232f711dbdb..20f45ff849fc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -53,6 +53,8 @@ #include <linux/memory.h> #include <linux/pagewalk.h> #include <linux/shmem_fs.h> +#include <linux/ctype.h> +#include <linux/debugfs.h> #include <asm/tlbflush.h> #include <asm/div64.h> @@ -5021,6 +5023,413 @@ static void lru_gen_change_state(bool enable) mem_hotplug_done(); } +/****************************************************************************** + * sysfs interface + ******************************************************************************/ + +static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl))); +} + +static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t len) +{ + unsigned int msecs; + + if (kstrtouint(buf, 10, &msecs)) + return -EINVAL; + + WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs)); + + return len; +} + +static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR( + min_ttl_ms, 0644, show_min_ttl, store_min_ttl +); + +static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf) +{ + return snprintf(buf, PAGE_SIZE, "%d\n", lru_gen_enabled()); +} + +static ssize_t store_enable(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t len) +{ + bool enable; + + if (kstrtobool(buf, &enable)) + return -EINVAL; + + lru_gen_change_state(enable); + + return len; +} + +static struct kobj_attribute lru_gen_enabled_attr = __ATTR( + enabled, 0644, show_enable, store_enable +); + +static struct attribute *lru_gen_attrs[] = { + &lru_gen_min_ttl_attr.attr, + &lru_gen_enabled_attr.attr, + NULL +}; + +static struct attribute_group lru_gen_attr_group = { + .name = "lru_gen", + .attrs = lru_gen_attrs, +}; + +/****************************************************************************** + * debugfs interface + ******************************************************************************/ + +static void *lru_gen_seq_start(struct seq_file *m, loff_t *pos) +{ + struct mem_cgroup *memcg; + loff_t nr_to_skip = *pos; + + m->private = kvmalloc(PATH_MAX, GFP_KERNEL); + if (!m->private) + return ERR_PTR(-ENOMEM); + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + int nid; + + for_each_node_state(nid, N_MEMORY) { + if (!nr_to_skip--) + return get_lruvec(memcg, nid); + } + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + + return NULL; +} + +static void lru_gen_seq_stop(struct seq_file *m, void *v) +{ + if (!IS_ERR_OR_NULL(v)) + mem_cgroup_iter_break(NULL, lruvec_memcg(v)); + + kvfree(m->private); + m->private = NULL; +} + +static void *lru_gen_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + int nid = lruvec_pgdat(v)->node_id; + struct mem_cgroup *memcg = lruvec_memcg(v); + + ++*pos; + + nid = next_memory_node(nid); + if (nid == MAX_NUMNODES) { + memcg = mem_cgroup_iter(NULL, memcg, NULL); + if (!memcg) + return NULL; + + nid = first_memory_node; + } + + return get_lruvec(memcg, nid); +} + +static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec, + unsigned long max_seq, unsigned long *min_seq, + unsigned long seq) +{ + int i; + int type, tier; + int hist = lru_hist_from_seq(seq); + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + for (tier = 0; tier < MAX_NR_TIERS; tier++) { + seq_printf(m, " %10d", tier); + for (type = 0; type < ANON_AND_FILE; type++) { + unsigned long n[3] = {}; + + if (seq == max_seq) { + n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]); + n[1] = READ_ONCE(lrugen->avg_total[type][tier]); + + seq_printf(m, " %10luR %10luT %10lu ", n[0], n[1], n[2]); + } else if (seq == min_seq[type] || NR_HIST_GENS > 1) { + n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]); + n[1] = atomic_long_read(&lrugen->evicted[hist][type][tier]); + if (tier) + n[2] = READ_ONCE(lrugen->promoted[hist][type][tier - 1]); + + seq_printf(m, " %10lur %10lue %10lup", n[0], n[1], n[2]); + } else + seq_puts(m, " 0 0 0 "); + } + seq_putc(m, '\n'); + } + + seq_puts(m, " "); + for (i = 0; i < NR_MM_STATS; i++) { + if (seq == max_seq && NR_HIST_GENS == 1) + seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]), + toupper(MM_STAT_CODES[i])); + else if (seq != max_seq && NR_HIST_GENS > 1) + seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]), + MM_STAT_CODES[i]); + else + seq_puts(m, " 0 "); + } + seq_putc(m, '\n'); +} + +static int lru_gen_seq_show(struct seq_file *m, void *v) +{ + unsigned long seq; + bool full = !debugfs_real_fops(m->file)->write; + struct lruvec *lruvec = v; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + int nid = lruvec_pgdat(lruvec)->node_id; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + if (nid == first_memory_node) { + const char *path = memcg ? m->private : ""; + +#ifdef CONFIG_MEMCG + if (memcg) + cgroup_path(memcg->css.cgroup, m->private, PATH_MAX); +#endif + seq_printf(m, "memcg %5hu %s\n", mem_cgroup_id(memcg), path); + } + + seq_printf(m, " node %5d\n", nid); + + if (!full) + seq = min_seq[0]; + else if (max_seq >= MAX_NR_GENS) + seq = max_seq - MAX_NR_GENS + 1; + else + seq = 0; + + for (; seq <= max_seq; seq++) { + int gen, type, zone; + unsigned int msecs; + + gen = lru_gen_from_seq(seq); + msecs = jiffies_to_msecs(jiffies - READ_ONCE(lrugen->timestamps[gen])); + + seq_printf(m, " %10lu %10u", seq, msecs); + + for (type = 0; type < ANON_AND_FILE; type++) { + long size = 0; + + if (seq < min_seq[type]) { + seq_puts(m, " -0 "); + continue; + } + + for (zone = 0; zone < MAX_NR_ZONES; zone++) + size += READ_ONCE(lrugen->nr_pages[gen][type][zone]); + + seq_printf(m, " %10lu ", max(size, 0L)); + } + + seq_putc(m, '\n'); + + if (full) + lru_gen_seq_show_full(m, lruvec, max_seq, min_seq, seq); + } + + return 0; +} + +static const struct seq_operations lru_gen_seq_ops = { + .start = lru_gen_seq_start, + .stop = lru_gen_seq_stop, + .next = lru_gen_seq_next, + .show = lru_gen_seq_show, +}; + +static int run_aging(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc, + bool can_swap, bool full_scan) +{ + DEFINE_MAX_SEQ(lruvec); + + if (seq == max_seq) + try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, full_scan); + + return seq > max_seq ? -EINVAL : 0; +} + +static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc, + int swappiness, unsigned long nr_to_reclaim) +{ + struct blk_plug plug; + int err = -EINTR; + DEFINE_MAX_SEQ(lruvec); + + if (max_seq < seq + MIN_NR_GENS) + return -EINVAL; + + sc->nr_reclaimed = 0; + + blk_start_plug(&plug); + + while (!signal_pending(current)) { + DEFINE_MIN_SEQ(lruvec); + + if (seq < min_seq[!swappiness] || sc->nr_reclaimed >= nr_to_reclaim || + !evict_folios(lruvec, sc, swappiness, NULL)) { + err = 0; + break; + } + + cond_resched(); + } + + blk_finish_plug(&plug); + + return err; +} + +static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq, + struct scan_control *sc, int swappiness, unsigned long opt) +{ + struct lruvec *lruvec; + int err = -EINVAL; + struct mem_cgroup *memcg = NULL; + + if (!mem_cgroup_disabled()) { + rcu_read_lock(); + memcg = mem_cgroup_from_id(memcg_id); +#ifdef CONFIG_MEMCG + if (memcg && !css_tryget(&memcg->css)) + memcg = NULL; +#endif + rcu_read_unlock(); + + if (!memcg) + goto done; + } + if (memcg_id != mem_cgroup_id(memcg)) + goto done; + + if (nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY)) + goto done; + + lruvec = get_lruvec(memcg, nid); + + if (swappiness < 0) + swappiness = get_swappiness(memcg); + else if (swappiness > 200) + goto done; + + switch (cmd) { + case '+': + err = run_aging(lruvec, seq, sc, swappiness, opt); + break; + case '-': + err = run_eviction(lruvec, seq, sc, swappiness, opt); + break; + } +done: + mem_cgroup_put(memcg); + + return err; +} + +static ssize_t lru_gen_seq_write(struct file *file, const char __user *src, + size_t len, loff_t *pos) +{ + void *buf; + char *cur, *next; + unsigned int flags; + int err = 0; + struct scan_control sc = { + .may_writepage = 1, + .may_unmap = 1, + .may_swap = 1, + .reclaim_idx = MAX_NR_ZONES - 1, + .gfp_mask = GFP_KERNEL, + }; + + buf = kvmalloc(len + 1, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + if (copy_from_user(buf, src, len)) { + kvfree(buf); + return -EFAULT; + } + + next = buf; + next[len] = '\0'; + + sc.reclaim_state.mm_walk = alloc_mm_walk(); + if (!sc.reclaim_state.mm_walk) { + kvfree(buf); + return -ENOMEM; + } + + flags = memalloc_noreclaim_save(); + set_task_reclaim_state(current, &sc.reclaim_state); + + while ((cur = strsep(&next, ",;\n"))) { + int n; + int end; + char cmd; + unsigned int memcg_id; + unsigned int nid; + unsigned long seq; + unsigned int swappiness = -1; + unsigned long opt = -1; + + cur = skip_spaces(cur); + if (!*cur) + continue; + + n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid, + &seq, &end, &swappiness, &end, &opt, &end); + if (n < 4 || cur[end]) { + err = -EINVAL; + break; + } + + err = run_cmd(cmd, memcg_id, nid, seq, &sc, swappiness, opt); + if (err) + break; + } + + set_task_reclaim_state(current, NULL); + memalloc_noreclaim_restore(flags); + + free_mm_walk(sc.reclaim_state.mm_walk); + kvfree(buf); + + return err ? : len; +} + +static int lru_gen_seq_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &lru_gen_seq_ops); +} + +static const struct file_operations lru_gen_rw_fops = { + .open = lru_gen_seq_open, + .read = seq_read, + .write = lru_gen_seq_write, + .llseek = seq_lseek, + .release = seq_release, +}; + +static const struct file_operations lru_gen_ro_fops = { + .open = lru_gen_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + /****************************************************************************** * initialization ******************************************************************************/ @@ -5087,6 +5496,12 @@ static int __init init_lru_gen(void) BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS); BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1); + if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) + pr_err("lru_gen: failed to create sysfs group\n"); + + debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops); + debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops); + return 0; }; late_initcall(init_lru_gen);