diff mbox series

zswap: improve memory.zswap.writeback inheritance

Message ID 20240926225531.700742-1-intelfx@intelfx.name (mailing list archive)
State New
Headers show
Series zswap: improve memory.zswap.writeback inheritance | expand

Commit Message

Ivan Shapovalov Sept. 26, 2024, 10:55 p.m. UTC
Improve the inheritance behavior of the `memory.zswap.writeback` cgroup
attribute introduced during the 6.11 cycle. Specifically, in 6.11 we
walk the parent cgroups until we find a _disabled_ writeback, which does
not allow the user to selectively enable zswap writeback while having it
disabled by default.

Instead, introduce a third value for the `memory.zswap.writeback` cgroup
attribute meaning "follow the parent", and use it as the default value
for all cgroups. Additionally, introduce a `zswap.writeback_disable`
module parameter which is used as the "parent" of the root cgroup,
thus making it the global default.

Reads from `memory.zswap.writeback` cgroup attribute will be coerced to
0 or 1 to maintain backwards compatilibity. However, it is possible to
write -1 to that attribute to make the cgroup follow the parent again.

Fixes: e39925734909 ("mm/memcontrol: respect zswap.writeback setting from parent cg too")
Fixes: 501a06fe8e4c ("zswap: memcontrol: implement zswap writeback disabling")
Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
---
 Documentation/admin-guide/cgroup-v2.rst | 17 +++++++++++------
 Documentation/admin-guide/mm/zswap.rst  |  9 ++++++++-
 include/linux/memcontrol.h              |  3 ++-
 include/linux/zswap.h                   |  6 ++++++
 mm/memcontrol.c                         | 24 +++++++++++++++++-------
 mm/zswap.c                              |  9 +++++++++
 6 files changed, 53 insertions(+), 15 deletions(-)

Comments

Nhat Pham Sept. 26, 2024, 11:12 p.m. UTC | #1
On Thu, Sep 26, 2024 at 3:55 PM Ivan Shapovalov <intelfx@intelfx.name> wrote:
>
> Improve the inheritance behavior of the `memory.zswap.writeback` cgroup
> attribute introduced during the 6.11 cycle. Specifically, in 6.11 we
> walk the parent cgroups until we find a _disabled_ writeback, which does
> not allow the user to selectively enable zswap writeback while having it
> disabled by default.

Is there an actual need for this? This is a theoretical use case I
thought of (and raised), but I don't think anybody actually wants
this...?

Besides, most people who want this can just:

1. Enable zswap writeback on root cgroup (and all non-leaf cgroups).

2. Disable zswap writeback on leaf cgroups on creation by default.

3. Selectively enable zswap writeback for the leaf cgroups.

All of this is quite doable in userspace. It's not even _that_ racy -
just do this before adding tasks etc. to the cgroup?
Ivan Shapovalov Sept. 26, 2024, 11:40 p.m. UTC | #2
On 2024-09-26 at 16:12 -0700, Nhat Pham wrote:
> On Thu, Sep 26, 2024 at 3:55 PM Ivan Shapovalov <intelfx@intelfx.name> wrote:
> > 
> > Improve the inheritance behavior of the `memory.zswap.writeback` cgroup
> > attribute introduced during the 6.11 cycle. Specifically, in 6.11 we
> > walk the parent cgroups until we find a _disabled_ writeback, which does
> > not allow the user to selectively enable zswap writeback while having it
> > disabled by default.
> 
> Is there an actual need for this? This is a theoretical use case I
> thought of (and raised), but I don't think anybody actually wants
> this...?

This is of course anecdata, but yes, it does solve a real use-case that
I'm having right now, as well as a bunch of my colleagues who recently
complained to me (in private) about pretty much the same use-case.

The use-case is following: it turns out that it could be beneficial for
desktop systems to run with a pretty high swappiness and zswap
writeback globally disabled, to nudge the system to compress cold pages
but not actually write them back to the disk (which would happen pretty
aggressively if it was not disabled) to reduce I/O and latencies.
However, under this setup it is sometimes needed to re-enable zswap
writeback for specific memory-heavy applications that allocate a lot of
cold pages, to "allow" the kernel to actually swap those programs out.

> 
> Besides, most people who want this can just:
> 
> 1. Enable zswap writeback on root cgroup (and all non-leaf cgroups).
> 
> 2. Disable zswap writeback on leaf cgroups on creation by default.
> 
> 3. Selectively enable zswap writeback for the leaf cgroups.
> 
> All of this is quite doable in userspace. It's not even _that_ racy -
> just do this before adding tasks etc. to the cgroup?

Well, yes, it is technically doable from userspace, just like it was
technically doable prior to commit e39925734909 to have userspace
explicitly control the entire hierarchy of writeback settings.
However it _is_ pretty painful, and the flow you described would
essentially negate any benefits of that patch (it would require
userspace to, once again, manage the entire hierarchy explicitly
without any help from the kernel).

IOWs, per the above commit I concluded that reducing pain levels for
userspace implementations was an acceptable design goal :-)

Thanks,
Nhat Pham Sept. 27, 2024, 1:52 a.m. UTC | #3
On Thu, Sep 26, 2024 at 4:40 PM Ivan Shapovalov <intelfx@intelfx.name> wrote:
>
> On 2024-09-26 at 16:12 -0700, Nhat Pham wrote:
> > On Thu, Sep 26, 2024 at 3:55 PM Ivan Shapovalov <intelfx@intelfx.name> wrote:
> > >
> > > Improve the inheritance behavior of the `memory.zswap.writeback` cgroup
> > > attribute introduced during the 6.11 cycle. Specifically, in 6.11 we
> > > walk the parent cgroups until we find a _disabled_ writeback, which does
> > > not allow the user to selectively enable zswap writeback while having it
> > > disabled by default.
> >
> > Is there an actual need for this? This is a theoretical use case I
> > thought of (and raised), but I don't think anybody actually wants
> > this...?
>
> This is of course anecdata, but yes, it does solve a real use-case that
> I'm having right now, as well as a bunch of my colleagues who recently
> complained to me (in private) about pretty much the same use-case.
>
> The use-case is following: it turns out that it could be beneficial for
> desktop systems to run with a pretty high swappiness and zswap
> writeback globally disabled, to nudge the system to compress cold pages
> but not actually write them back to the disk (which would happen pretty
> aggressively if it was not disabled) to reduce I/O and latencies.
> However, under this setup it is sometimes needed to re-enable zswap
> writeback for specific memory-heavy applications that allocate a lot of
> cold pages, to "allow" the kernel to actually swap those programs out.

Out of pure curiosity (and to make sure I fully grasp the problem at
hand), is this the capacity-based zswap writeback (i.e the one
triggered by limits), or the memory pressure based dynamic shrinker?

If you disable the latter and only rely on the former, it should not
"write pages aggressively". Limits are rarely reached (IIUC, zswap.max
are frequently used as binary knobs, and global limits are hard to
reach), so usually pages that are going to disk swap are just pages
zswap reject (i.e mostly just pages that fail to compress). This
should be a very small category. You will still see "swap" usage due
to a quirk in zswap architecture (which I'm working to fix), but there
should rarely be any IOs. So the setup itself is not super necessary.

If it's the latter then yeah I can kinda see the need for the setup.

>
> >
> > Besides, most people who want this can just:
> >
> > 1. Enable zswap writeback on root cgroup (and all non-leaf cgroups).
> >
> > 2. Disable zswap writeback on leaf cgroups on creation by default.
> >
> > 3. Selectively enable zswap writeback for the leaf cgroups.
> >
> > All of this is quite doable in userspace. It's not even _that_ racy -
> > just do this before adding tasks etc. to the cgroup?
>
> Well, yes, it is technically doable from userspace, just like it was
> technically doable prior to commit e39925734909 to have userspace
> explicitly control the entire hierarchy of writeback settings.
> However it _is_ pretty painful, and the flow you described would
> essentially negate any benefits of that patch (it would require
> userspace to, once again, manage the entire hierarchy explicitly
> without any help from the kernel).

I think it's a tad different. In the case of the commit e39925734909,
the hierarchical behavior of zswap.writeback knob is quite
semantically confusing, almost counter-intuitive (and does not conform
to the convention of other cgroup knobs, which use the most
restrictive limit - check out zswap.max limit checking for example).
That commit arguably fixes it for the "common" case (i.e you want the
hierarchical enforcement to hold for the most part). That's why there
were even conversations about cc-ing the stable mailing list for
backporting it to older kernels.

This is more of a "new use case" patch. It complicates the API, for
something readily doable in userspace - the kernel does not do
anything that userspace cannot achieve. So it should undergo stricter
scrutiny. :)

Anyway, Yosry, Johannes, how do you feel about this?
Nhat Pham Sept. 27, 2024, 2:28 a.m. UTC | #4
On Thu, Sep 26, 2024 at 3:55 PM Ivan Shapovalov <intelfx@intelfx.name> wrote:
>
> Improve the inheritance behavior of the `memory.zswap.writeback` cgroup
> attribute introduced during the 6.11 cycle. Specifically, in 6.11 we
> walk the parent cgroups until we find a _disabled_ writeback, which does
> not allow the user to selectively enable zswap writeback while having it
> disabled by default.
>
> Instead, introduce a third value for the `memory.zswap.writeback` cgroup
> attribute meaning "follow the parent", and use it as the default value
> for all cgroups. Additionally, introduce a `zswap.writeback_disable`
> module parameter which is used as the "parent" of the root cgroup,
> thus making it the global default.
>
> Reads from `memory.zswap.writeback` cgroup attribute will be coerced to
> 0 or 1 to maintain backwards compatilibity. However, it is possible to
> write -1 to that attribute to make the cgroup follow the parent again.
>
> Fixes: e39925734909 ("mm/memcontrol: respect zswap.writeback setting from parent cg too")
> Fixes: 501a06fe8e4c ("zswap: memcontrol: implement zswap writeback disabling")
> Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 17 +++++++++++------
>  Documentation/admin-guide/mm/zswap.rst  |  9 ++++++++-
>  include/linux/memcontrol.h              |  3 ++-
>  include/linux/zswap.h                   |  6 ++++++
>  mm/memcontrol.c                         | 24 +++++++++++++++++-------
>  mm/zswap.c                              |  9 +++++++++
>  6 files changed, 53 insertions(+), 15 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 95c18bc17083..eea580490679 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1717,10 +1717,12 @@ The following nested keys are defined.
>         entries fault back in or are written out to disk.
>
>    memory.zswap.writeback
> -       A read-write single value file. The default value is "1".
> -       Note that this setting is hierarchical, i.e. the writeback would be
> -       implicitly disabled for child cgroups if the upper hierarchy
> -       does so.
> +       A read-write single value file. The default is to follow the parent
> +       cgroup configuration, and the root cgroup follows the global
> +       ``zswap.writeback_enabled`` module parameter (which is 1 by default).
> +       Thus, this setting is hierarchical, i.e. the writeback setting for
> +       a child cgroup can be implicitly controlled at runtime by changing
> +       any parent value or the global module parameter.
>
>         When this is set to 0, all swapping attempts to swapping devices
>         are disabled. This included both zswap writebacks, and swapping due
> @@ -1729,8 +1731,11 @@ The following nested keys are defined.
>         reclaim inefficiency after disabling writeback (because the same
>         pages might be rejected again and again).
>
> -       Note that this is subtly different from setting memory.swap.max to
> -       0, as it still allows for pages to be written to the zswap pool.
> +       Note that this is different from setting memory.swap.max to 0,
> +       as it still allows for pages to be written to the zswap pool.
> +
> +       This can also be set to -1, which would make the cgroup (and its
> +       future children) follow the parent/global value again.
>

API-design-wise, this seems a bit confusing... Using the value -1 to
indicate the cgroup should follow ancestor is not quite semantically
meaningful. What does "follow the parent" here mean? Reading your
code, it seems to be "find the first non negative from this memcg to
root and just use it", but it's neither intuitive, nor deducable from
the documentation.

There are also ill-defined/poorly-defined behavior. What if we set -1
to all cgroups? Should writeback be enabled or disabled?

The implementation also contradicts the "writeback setting for a child
cgroup can be implicitly controlled at runtime by chaging any parent
value or the global module parameter" part. What happens if we have
the following sequence of zswap.writeback values, from root to current
memcg:

root.memcg.zswap.writeback = 0, 1, -1, -1, -1 = cur_memcg.zswap.writeback

Should we disable or enable cur_memcg's zswap.writeback? It's disabled
at root, but the first non-negative ancestral value is 1, so this
cgroup will follow it (even though that ancestor itself has zswap
writeback disabled!!!)

I think we should separate the values itself from the decision to
check ancestors. The former should be indicated per-memcg, and the
latter decision should only come into play when the memcg itself does
prevent zswap.writeback. It might be less confusing to make the latter
decision globally - perhaps through a mount option, analogous to
memory_recursiveprot?
Michal Koutný Sept. 27, 2024, 3:01 p.m. UTC | #5
Hello.

On Thu, Sep 26, 2024 at 07:28:08PM GMT, Nhat Pham <nphamcs@gmail.com> wrote:
> API-design-wise, this seems a bit confusing... Using the value -1 to
> indicate the cgroup should follow ancestor is not quite semantically
> meaningful.

What about assigning this semantic to an empty string ("")?
That would be the default behavior and also the value shown when reading
the file (to distinguish this for explicitly configured values).

(The weirdness of 0, 1, -1, -1, -1  would remain. Maybe switching this
via the mount option could satisfy any user. Admittedly, I tend to
confuse this knob with swap.max.)

Michal
Nhat Pham Sept. 27, 2024, 4:40 p.m. UTC | #6
On Fri, Sep 27, 2024 at 8:01 AM Michal Koutný <mkoutny@suse.com> wrote:
>
>
> What about assigning this semantic to an empty string ("")?
> That would be the default behavior and also the value shown when reading
> the file (to distinguish this for explicitly configured values).

Yeah that's better than -1, I agree. Still a bit confusing, but at
least the semantic is "we are not making a choice at the memcg".

>
> (The weirdness of 0, 1, -1, -1, -1  would remain. Maybe switching this
> via the mount option could satisfy any user. Admittedly, I tend to
> confuse this knob with swap.max.)

Yeah a mount option, or in general some sort of global knob (with
proper documentation) would be preferable.

And yeah, I hear you with the swap.max confusion. That's why I tried
to make it explicit in the documentation, because the difference is
subtle and can trip up users. Hopefully, when zswap and swap are
decoupled, users can conceptualize them as two separate tiers, and the
confusion will lessen...
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 95c18bc17083..eea580490679 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1717,10 +1717,12 @@  The following nested keys are defined.
 	entries fault back in or are written out to disk.
 
   memory.zswap.writeback
-	A read-write single value file. The default value is "1".
-	Note that this setting is hierarchical, i.e. the writeback would be
-	implicitly disabled for child cgroups if the upper hierarchy
-	does so.
+	A read-write single value file. The default is to follow the parent
+	cgroup configuration, and the root cgroup follows the global
+	``zswap.writeback_enabled`` module parameter (which is 1 by default).
+	Thus, this setting is hierarchical, i.e. the writeback setting for
+	a child cgroup can be implicitly controlled at runtime by changing
+	any parent value or the global module parameter.
 
 	When this is set to 0, all swapping attempts to swapping devices
 	are disabled. This included both zswap writebacks, and swapping due
@@ -1729,8 +1731,11 @@  The following nested keys are defined.
 	reclaim inefficiency after disabling writeback (because the same
 	pages might be rejected again and again).
 
-	Note that this is subtly different from setting memory.swap.max to
-	0, as it still allows for pages to be written to the zswap pool.
+	Note that this is different from setting memory.swap.max to 0,
+	as it still allows for pages to be written to the zswap pool.
+
+	This can also be set to -1, which would make the cgroup (and its
+	future children) follow the parent/global value again.
 
   memory.pressure
 	A read-only nested-keyed file.
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index 3598dcd7dbe7..20267b8893db 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -126,10 +126,17 @@  Setting this parameter to 100 will disable the hysteresis.
 
 Some users cannot tolerate the swapping that comes with zswap store failures
 and zswap writebacks. Swapping can be disabled entirely (without disabling
-zswap itself) on a cgroup-basis as follows::
+zswap itself) either globally, using the ``writeback_enabled`` sysfs attribute,
+or on a per-cgroup basis, e.g.::
+
+	echo 0 > /sys/module/zswap/parameters/writeback_enabled
 
 	echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback
 
+All cgroups follow (i.e. dynamically inherit) the parent configuration
+by default, and the root cgroup follows the module parameter (which can
+thus be considered the global default).
+
 Note that if the store failures are recurring (for e.g if the pages are
 incompressible), users can observe reclaim inefficiency after disabling
 writeback (because the same pages might be rejected again and again).
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0e5bf25d324f..ca0510057040 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -202,8 +202,9 @@  struct mem_cgroup {
 	/*
 	 * Prevent pages from this memcg from being written back from zswap to
 	 * swap, and from being swapped out on zswap store failures.
+	 * (< 0: follow the parent/global default)
 	 */
-	bool zswap_writeback;
+	int zswap_writeback;
 #endif
 
 	/* vmpressure notifications */
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 6cecb4a4f68b..7d121fdb3521 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -37,6 +37,7 @@  void zswap_lruvec_state_init(struct lruvec *lruvec);
 void zswap_folio_swapin(struct folio *folio);
 bool zswap_is_enabled(void);
 bool zswap_never_enabled(void);
+bool zswap_writeback_is_enabled(void);
 #else
 
 struct zswap_lruvec_state {};
@@ -71,6 +72,11 @@  static inline bool zswap_never_enabled(void)
 	return true;
 }
 
+static inline bool zswap_writeback_is_enabled(void)
+{
+	return true;
+}
+
 #endif
 
 #endif /* _LINUX_ZSWAP_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d563fb515766..1e0aca42e5a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3613,7 +3613,7 @@  mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	memcg1_soft_limit_reset(memcg);
 #ifdef CONFIG_ZSWAP
 	memcg->zswap_max = PAGE_COUNTER_MAX;
-	WRITE_ONCE(memcg->zswap_writeback, true);
+	WRITE_ONCE(memcg->zswap_writeback, -1);
 #endif
 	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
 	if (parent) {
@@ -5318,15 +5318,25 @@  void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
 
 bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
 {
+	int memcg_zswap_writeback;
+
 	/* if zswap is disabled, do not block pages going to the swapping device */
 	if (!zswap_is_enabled())
 		return true;
 
-	for (; memcg; memcg = parent_mem_cgroup(memcg))
-		if (!READ_ONCE(memcg->zswap_writeback))
-			return false;
+	/*
+	 * -1 means "follow the parent" (root cgroup follows the global default).
+	 * Walk cgroups until we find something, otherwise return the global default.
+	 */
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+		memcg_zswap_writeback = READ_ONCE(memcg->zswap_writeback);
+		if (memcg_zswap_writeback >= 0)
+			goto found;
+	}
+	return zswap_writeback_is_enabled();
 
-	return true;
+found:
+	return !!memcg_zswap_writeback;
 }
 
 static u64 zswap_current_read(struct cgroup_subsys_state *css,
@@ -5365,7 +5375,7 @@  static int zswap_writeback_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	seq_printf(m, "%d\n", READ_ONCE(memcg->zswap_writeback));
+	seq_printf(m, "%d\n", mem_cgroup_zswap_writeback_enabled(memcg));
 	return 0;
 }
 
@@ -5379,7 +5389,7 @@  static ssize_t zswap_writeback_write(struct kernfs_open_file *of,
 	if (parse_ret)
 		return parse_ret;
 
-	if (zswap_writeback != 0 && zswap_writeback != 1)
+	if (zswap_writeback < -1 || zswap_writeback > 1)
 		return -EINVAL;
 
 	WRITE_ONCE(memcg->zswap_writeback, zswap_writeback);
diff --git a/mm/zswap.c b/mm/zswap.c
index adeaf9c97fde..724d8a02d61c 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -129,6 +129,10 @@  static bool zswap_shrinker_enabled = IS_ENABLED(
 		CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
 module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
 
+/* Enable/disable zswap writeback globally */
+static bool zswap_writeback_enabled = true;
+module_param_named(writeback_enabled, zswap_writeback_enabled, bool, 0644);
+
 bool zswap_is_enabled(void)
 {
 	return zswap_enabled;
@@ -139,6 +143,11 @@  bool zswap_never_enabled(void)
 	return !static_branch_maybe(CONFIG_ZSWAP_DEFAULT_ON, &zswap_ever_enabled);
 }
 
+bool zswap_writeback_is_enabled(void)
+{
+	return zswap_writeback_enabled;
+}
+
 /*********************************
 * data structures
 **********************************/