Revert "mm: add nodes= arg to memory.reclaim"

Message ID	Y5xASNe1x8cusiTx@dhcp22.suse.cz (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Fri, 16 Dec 2022 10:54:16 +0100 From: Michal Hocko <mhocko@suse.com> To: Mina Almasry <almasrymina@google.com>, Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>, Jonathan Corbet <corbet@lwn.net>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <songmuchun@bytedance.com>, Andrew Morton <akpm@linux-foundation.org>, Huang Ying <ying.huang@intel.com>, Yang Shi <yang.shi@linux.alibaba.com>, Yosry Ahmed <yosryahmed@google.com>, weixugc@google.com, fvdl@google.com, bagasdotme@gmail.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH] Revert "mm: add nodes= arg to memory.reclaim" Message-ID: <Y5xASNe1x8cusiTx@dhcp22.suse.cz> References: <20221202223533.1785418-1-almasrymina@google.com> <Y5bsmpCyeryu3Zz1@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <Y5bsmpCyeryu3Zz1@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Revert "mm: add nodes= arg to memory.reclaim" \| expand Revert "mm: add nodes= arg to memory.reclaim"

Michal Hocko Dec. 16, 2022, 9:54 a.m. UTC

Andrew,
I have noticed that the patch made it into Linus tree already. Can we
please revert it because the semantic is not really clear and we should
really not create yet another user API maintenance problem. I am
proposing to revert the nodemask extension for now before we grow any
upstream users. Deeper in the email thread are some proposals how to
move forward with that.
--- 
From 7c5285f1725d5abfcae5548ab0d73be9ceded2a1 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 16 Dec 2022 10:46:33 +0100
Subject: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"

This reverts commit 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5.

Although it is recognized that a finer grained pro-active reclaim is
something we need and want the semantic of this implementation is really
ambiguous.

From a follow up discussion it became clear that there are two essential
usecases here. One is to use memory.reclaim to pro-actively reclaim
memory and expectation is that the requested and reported amount of memory is
uncharged from the memcg. Another usecase focuses on pro-active demotion
when the memory is merely shuffled around to demotion targets while the
overall charged memory stays unchanged.

The current implementation considers demoted pages as reclaimed and that
break both usecases. [1] has tried to address the reporting part but
there are more issues with that summarized in [2] and follow up emails.

Let's revert the nodemask based extension of the memcg pro-active
reclaim for now until we settle with a more robust semantic.

[1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com
[2] http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 15 +++---
 include/linux/swap.h                    |  3 +-
 mm/memcontrol.c                         | 67 +++++--------------------
 mm/vmscan.c                             |  4 +-
 4 files changed, 21 insertions(+), 68 deletions(-)

Mina Almasry Dec. 16, 2022, 12:02 p.m. UTC | #1

On Fri, Dec 16, 2022 at 1:54 AM Michal Hocko <mhocko@suse.com> wrote:
>
> Andrew,
> I have noticed that the patch made it into Linus tree already. Can we
> please revert it because the semantic is not really clear and we should
> really not create yet another user API maintenance problem. I am
> proposing to revert the nodemask extension for now before we grow any
> upstream users. Deeper in the email thread are some proposals how to
> move forward with that.

There are proposals, many which have been rejected due to not
addressing the motivating use cases and others that have been rejected
by fellow maintainers, and some that are awaiting feedback. No, there
is no other clear-cut way forward for this use case right now. I have
found the merged approach by far the most agreeable so far.

> ---
> From 7c5285f1725d5abfcae5548ab0d73be9ceded2a1 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 16 Dec 2022 10:46:33 +0100
> Subject: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
>
> This reverts commit 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5.
>
> Although it is recognized that a finer grained pro-active reclaim is
> something we need and want the semantic of this implementation is really
> ambiguous.
>
> From a follow up discussion it became clear that there are two essential
> usecases here. One is to use memory.reclaim to pro-actively reclaim
> memory and expectation is that the requested and reported amount of memory is
> uncharged from the memcg. Another usecase focuses on pro-active demotion
> when the memory is merely shuffled around to demotion targets while the
> overall charged memory stays unchanged.
>
> The current implementation considers demoted pages as reclaimed and that
> break both usecases.

I think you're making it sound like this specific patch broke both use
cases, and IMO that is not accurate. commit 3f1509c57b1b ("Revert
"mm/vmscan: never demote for memcg reclaim"") has been in the tree for
around 7 months now and that is the commit that enabled demotion in
memcg reclaim, and implicitly counted demoted pages as reclaimed in
memcg reclaim, which is the source of the ambiguity. Not the patch
that you are reverting here.

The irony I find with this revert is that this patch actually removes
the ambiguity and does not exacerbate it. Currently using
memory.reclaim _without_ the nodes= arg is ambiguous because demoted
pages count as reclaimed. On the other hand using memory.reclaim
_with_ the nodes= arg is completely unambiguous: the kernel will
demote-only from top tier nodes and reclaim-only from bottom tier
nodes.

> [1] has tried to address the reporting part but
> there are more issues with that summarized in [2] and follow up emails.
>

I am the one that put effort into resolving the ambiguity introduced
by commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
reclaim"") and proposed [1]. Reverting this patch does nothing to
resolve ambiguity that it did not introduce.

> Let's revert the nodemask based extension of the memcg pro-active
> reclaim for now until we settle with a more robust semantic.
>

I do not think we should revert this. It enables a couple of important
use cases for Google:

1. Enables us to specifically trigger proactive reclaim in a memcg on
a memory tiered system by specifying only the lower tiered nodes using
the nodes= arg.
2. Enabled us to specifically trigger proactive demotion in a memcg on
a memory tiered system by specifying only the top tier nodes using the
nodes= arg.

Both use cases are broken with this revert, and no progress to resolve
the ambiguity is made with this revert.

I agree with Michal that there is ambiguity that has existed in the
kernel for about 7 months now and is introduced by commit 3f1509c57b1b
("Revert "mm/vmscan: never demote for memcg reclaim""), and I'm trying
to fix this ambiguity in [1]. I think we should move forward in fixing
the ambiguity through the review of the patch in [1] and not revert
patches that enable useful use-cases and did not introduce the
ambiguity.

> [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com

Broken link. Actual link to my patch to fix the ambiguity:
[1] https://lore.kernel.org/linux-mm/20221206023406.3182800-1-almasrymina@google.com/

> [2] http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 15 +++---
>  include/linux/swap.h                    |  3 +-
>  mm/memcontrol.c                         | 67 +++++--------------------
>  mm/vmscan.c                             |  4 +-
>  4 files changed, 21 insertions(+), 68 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index c8ae7c897f14..74cec76be9f2 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1245,13 +1245,17 @@ PAGE_SIZE multiple when read back.
>         This is a simple interface to trigger memory reclaim in the
>         target cgroup.
>
> -       This file accepts a string which contains the number of bytes to
> -       reclaim.
> +       This file accepts a single key, the number of bytes to reclaim.
> +       No nested keys are currently supported.
>
>         Example::
>
>           echo "1G" > memory.reclaim
>
> +       The interface can be later extended with nested keys to
> +       configure the reclaim behavior. For example, specify the
> +       type of memory to reclaim from (anon, file, ..).
> +
>         Please note that the kernel can over or under reclaim from
>         the target cgroup. If less bytes are reclaimed than the
>         specified amount, -EAGAIN is returned.
> @@ -1263,13 +1267,6 @@ PAGE_SIZE multiple when read back.
>         This means that the networking layer will not adapt based on
>         reclaim induced by memory.reclaim.
>
> -       This file also allows the user to specify the nodes to reclaim from,
> -       via the 'nodes=' key, for example::
> -
> -         echo "1G nodes=0,1" > memory.reclaim
> -
> -       The above instructs the kernel to reclaim memory from nodes 0,1.
> -
>    memory.peak
>         A read-only single value file which exists on non-root
>         cgroups.
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2787b84eaf12..0ceed49516ad 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -418,8 +418,7 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                                   unsigned long nr_pages,
>                                                   gfp_t gfp_mask,
> -                                                 unsigned int reclaim_options,
> -                                                 nodemask_t *nodemask);
> +                                                 unsigned int reclaim_options);
>  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>                                                 gfp_t gfp_mask, bool noswap,
>                                                 pg_data_t *pgdat,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab457f0394ab..73afff8062f9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -63,7 +63,6 @@
>  #include <linux/resume_user_mode.h>
>  #include <linux/psi.h>
>  #include <linux/seq_buf.h>
> -#include <linux/parser.h>
>  #include "internal.h"
>  #include <net/sock.h>
>  #include <net/ip.h>
> @@ -2393,8 +2392,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
>                 psi_memstall_enter(&pflags);
>                 nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
>                                                         gfp_mask,
> -                                                       MEMCG_RECLAIM_MAY_SWAP,
> -                                                       NULL);
> +                                                       MEMCG_RECLAIM_MAY_SWAP);
>                 psi_memstall_leave(&pflags);
>         } while ((memcg = parent_mem_cgroup(memcg)) &&
>                  !mem_cgroup_is_root(memcg));
> @@ -2685,8 +2683,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
>         psi_memstall_enter(&pflags);
>         nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> -                                                   gfp_mask, reclaim_options,
> -                                                   NULL);
> +                                                   gfp_mask, reclaim_options);
>         psi_memstall_leave(&pflags);
>
>         if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -3506,8 +3503,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>                 }
>
>                 if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -                                       memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
> -                                       NULL)) {
> +                                       memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
>                         ret = -EBUSY;
>                         break;
>                 }
> @@ -3618,8 +3614,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>                         return -EINTR;
>
>                 if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -                                                 MEMCG_RECLAIM_MAY_SWAP,
> -                                                 NULL))
> +                                                 MEMCG_RECLAIM_MAY_SWAP))
>                         nr_retries--;
>         }
>
> @@ -6429,8 +6424,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>                 }
>
>                 reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> -                                       NULL);
> +                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
>
>                 if (!reclaimed && !nr_retries--)
>                         break;
> @@ -6479,8 +6473,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>
>                 if (nr_reclaims) {
>                         if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> -                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> -                                       NULL))
> +                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
>                                 nr_reclaims--;
>                         continue;
>                 }
> @@ -6603,54 +6596,21 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>         return nbytes;
>  }
>
> -enum {
> -       MEMORY_RECLAIM_NODES = 0,
> -       MEMORY_RECLAIM_NULL,
> -};
> -
> -static const match_table_t if_tokens = {
> -       { MEMORY_RECLAIM_NODES, "nodes=%s" },
> -       { MEMORY_RECLAIM_NULL, NULL },
> -};
> -
>  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>                               size_t nbytes, loff_t off)
>  {
>         struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>         unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>         unsigned long nr_to_reclaim, nr_reclaimed = 0;
> -       unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
> -                                      MEMCG_RECLAIM_PROACTIVE;
> -       char *old_buf, *start;
> -       substring_t args[MAX_OPT_ARGS];
> -       int token;
> -       char value[256];
> -       nodemask_t nodemask = NODE_MASK_ALL;
> -
> -       buf = strstrip(buf);
> -
> -       old_buf = buf;
> -       nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> -       if (buf == old_buf)
> -               return -EINVAL;
> +       unsigned int reclaim_options;
> +       int err;
>
>         buf = strstrip(buf);
> +       err = page_counter_memparse(buf, "", &nr_to_reclaim);
> +       if (err)
> +               return err;
>
> -       while ((start = strsep(&buf, " ")) != NULL) {
> -               if (!strlen(start))
> -                       continue;
> -               token = match_token(start, if_tokens, args);
> -               match_strlcpy(value, args, sizeof(value));
> -               switch (token) {
> -               case MEMORY_RECLAIM_NODES:
> -                       if (nodelist_parse(value, nodemask) < 0)
> -                               return -EINVAL;
> -                       break;
> -               default:
> -                       return -EINVAL;
> -               }
> -       }
> -
> +       reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
>         while (nr_reclaimed < nr_to_reclaim) {
>                 unsigned long reclaimed;
>
> @@ -6667,8 +6627,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>
>                 reclaimed = try_to_free_mem_cgroup_pages(memcg,
>                                                 nr_to_reclaim - nr_reclaimed,
> -                                               GFP_KERNEL, reclaim_options,
> -                                               &nodemask);
> +                                               GFP_KERNEL, reclaim_options);
>
>                 if (!reclaimed && !nr_retries--)
>                         return -EAGAIN;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index aba991c505f1..546540bc770a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6757,8 +6757,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                            unsigned long nr_pages,
>                                            gfp_t gfp_mask,
> -                                          unsigned int reclaim_options,
> -                                          nodemask_t *nodemask)
> +                                          unsigned int reclaim_options)
>  {
>         unsigned long nr_reclaimed;
>         unsigned int noreclaim_flag;
> @@ -6773,7 +6772,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                 .may_unmap = 1,
>                 .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
>                 .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> -               .nodemask = nodemask,
>         };
>         /*
>          * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> --
> 2.30.2
>
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Dec. 16, 2022, 12:22 p.m. UTC | #2

On Fri 16-12-22 04:02:12, Mina Almasry wrote:
> On Fri, Dec 16, 2022 at 1:54 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > Andrew,
> > I have noticed that the patch made it into Linus tree already. Can we
> > please revert it because the semantic is not really clear and we should
> > really not create yet another user API maintenance problem. I am
> > proposing to revert the nodemask extension for now before we grow any
> > upstream users. Deeper in the email thread are some proposals how to
> > move forward with that.
> 
> There are proposals, many which have been rejected due to not
> addressing the motivating use cases and others that have been rejected
> by fellow maintainers, and some that are awaiting feedback. No, there
> is no other clear-cut way forward for this use case right now. I have
> found the merged approach by far the most agreeable so far.

There is a clear need for further discussion and until then we do not
want to expose interface and create dependencies that will inevitably
hard to change the semantic later.

> > From 7c5285f1725d5abfcae5548ab0d73be9ceded2a1 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Fri, 16 Dec 2022 10:46:33 +0100
> > Subject: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
> >
> > This reverts commit 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5.
> >
> > Although it is recognized that a finer grained pro-active reclaim is
> > something we need and want the semantic of this implementation is really
> > ambiguous.
> >
> > From a follow up discussion it became clear that there are two essential
> > usecases here. One is to use memory.reclaim to pro-actively reclaim
> > memory and expectation is that the requested and reported amount of memory is
> > uncharged from the memcg. Another usecase focuses on pro-active demotion
> > when the memory is merely shuffled around to demotion targets while the
> > overall charged memory stays unchanged.
> >
> > The current implementation considers demoted pages as reclaimed and that
> > break both usecases.
> 
> I think you're making it sound like this specific patch broke both use
> cases, and IMO that is not accurate. commit 3f1509c57b1b ("Revert
> "mm/vmscan: never demote for memcg reclaim"") has been in the tree for
> around 7 months now and that is the commit that enabled demotion in
> memcg reclaim, and implicitly counted demoted pages as reclaimed in
> memcg reclaim, which is the source of the ambiguity. Not the patch
> that you are reverting here.
> 
> The irony I find with this revert is that this patch actually removes
> the ambiguity and does not exacerbate it. Currently using
> memory.reclaim _without_ the nodes= arg is ambiguous because demoted
> pages count as reclaimed. On the other hand using memory.reclaim
> _with_ the nodes= arg is completely unambiguous: the kernel will
> demote-only from top tier nodes and reclaim-only from bottom tier
> nodes.

Yes, demoted patches are indeed counted as reclaimed but that is not a
major issue because from the external point of view charges are getting
reclaimed. It is nodes specification which makes the latent problem much
more obvious.

> 
> > [1] has tried to address the reporting part but
> > there are more issues with that summarized in [2] and follow up emails.
> >
> 
> I am the one that put effort into resolving the ambiguity introduced
> by commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim"") and proposed [1]. Reverting this patch does nothing to
> resolve ambiguity that it did not introduce.
> 
> > Let's revert the nodemask based extension of the memcg pro-active
> > reclaim for now until we settle with a more robust semantic.
> >
> 
> I do not think we should revert this. It enables a couple of important
> use cases for Google:
> 
> 1. Enables us to specifically trigger proactive reclaim in a memcg on
> a memory tiered system by specifying only the lower tiered nodes using
> the nodes= arg.
> 2. Enabled us to specifically trigger proactive demotion in a memcg on
> a memory tiered system by specifying only the top tier nodes using the
> nodes= arg.

That is clear and the aim of the revert is not to disallow those
usecases. We just need a clear and futureproof interface for that.
Changing the semantic after the fact is a nogo, hence the revert.
> 
> Both use cases are broken with this revert, and no progress to resolve
> the ambiguity is made with this revert.

There cannot be any regression with the revert now because the code
hasn't been upstream.

So let's remove the interface until we can agree on the exact semantic
and build the interface from there.

Bagas Sanjaya Dec. 16, 2022, 12:28 p.m. UTC | #3

On 12/16/22 16:54, Michal Hocko wrote:
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index c8ae7c897f14..74cec76be9f2 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1245,13 +1245,17 @@ PAGE_SIZE multiple when read back.
>  	This is a simple interface to trigger memory reclaim in the
>  	target cgroup.
>  
> -	This file accepts a string which contains the number of bytes to
> -	reclaim.
> +	This file accepts a single key, the number of bytes to reclaim.
> +	No nested keys are currently supported.
>  
>  	Example::
>  
>  	  echo "1G" > memory.reclaim
>  
> +	The interface can be later extended with nested keys to
> +	configure the reclaim behavior. For example, specify the
> +	type of memory to reclaim from (anon, file, ..).
> +
>  	Please note that the kernel can over or under reclaim from
>  	the target cgroup. If less bytes are reclaimed than the
>  	specified amount, -EAGAIN is returned.
> @@ -1263,13 +1267,6 @@ PAGE_SIZE multiple when read back.
>  	This means that the networking layer will not adapt based on
>  	reclaim induced by memory.reclaim.
>  
> -	This file also allows the user to specify the nodes to reclaim from,
> -	via the 'nodes=' key, for example::
> -
> -	  echo "1G nodes=0,1" > memory.reclaim
> -
> -	The above instructs the kernel to reclaim memory from nodes 0,1.
> -
>    memory.peak
>  	A read-only single value file which exists on non-root
>  	cgroups.

Ah! I forgot to add my Reviewed-by: tag when the original patch was
submitted. However, I was Cc'ed the revert presumably due to Cc: tag in the
original.

In any case, for the documentation part:

Acked-by: Bagas Sanjaya <bagasdotme@gmail.com>

Andrew Morton Dec. 16, 2022, 6:18 p.m. UTC | #4

On Fri, 16 Dec 2022 10:54:16 +0100 Michal Hocko <mhocko@suse.com> wrote:

> I have noticed that the patch made it into Linus tree already. Can we
> please revert it because the semantic is not really clear and we should
> really not create yet another user API maintenance problem.

Well dang.  I was waiting for the discussion to converge, blissfully
unaware that the thing was sitting in mm-stable :(  I guess the

	Acked-by: Michal Hocko <mhocko@suse.com>
	Acked-by: Shakeel Butt <shakeelb@google.com>
	Acked-by: Muchun Song <songmuchun@bytedance.com>

fooled me.

I think it's a bit premature to revert at this stage.  Possibly we can
get to the desired end state by modifying the existing code.  Possibly
we can get to the desired end state by reverting this and by adding
something new.

If we can't get to the desired end state at all then yes, I'll send
Linus a revert of this patch later in this -rc cycle.

Michal Hocko Dec. 17, 2022, 9:57 a.m. UTC | #5

On Fri 16-12-22 10:18:20, Andrew Morton wrote:
> On Fri, 16 Dec 2022 10:54:16 +0100 Michal Hocko <mhocko@suse.com> wrote:
> 
> > I have noticed that the patch made it into Linus tree already. Can we
> > please revert it because the semantic is not really clear and we should
> > really not create yet another user API maintenance problem.
> 
> Well dang.  I was waiting for the discussion to converge, blissfully
> unaware that the thing was sitting in mm-stable :(  I guess the
> 
> 	Acked-by: Michal Hocko <mhocko@suse.com>
> 	Acked-by: Shakeel Butt <shakeelb@google.com>
> 	Acked-by: Muchun Song <songmuchun@bytedance.com>
> 
> fooled me.

Hmm, as pointed out in http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz
I've failed to see through all the consequences of the implementation.
SO my bad here to add my ack before fully understanding all the
implications.

> I think it's a bit premature to revert at this stage.  Possibly we can
> get to the desired end state by modifying the existing code.  Possibly
> we can get to the desired end state by reverting this and by adding
> something new.

Sure if we can converge to a proper implementation during the rc phase
then it would be ok. I cannot speak for others but at least for me
upcoming 2 weeks would be mostly offline so I cannot really contribute
much. A revert would be much more easier from the coordination POV IMHO.

Also I do not think there is any strong reason to rush this in. I do not
really see any major problems to have this extension in 6.2

Andrew Morton Dec. 19, 2022, 10:42 p.m. UTC | #6

On Sat, 17 Dec 2022 10:57:06 +0100 Michal Hocko <mhocko@suse.com> wrote:

> > I think it's a bit premature to revert at this stage.  Possibly we can
> > get to the desired end state by modifying the existing code.  Possibly
> > we can get to the desired end state by reverting this and by adding
> > something new.
> 
> Sure if we can converge to a proper implementation during the rc phase
> then it would be ok. I cannot speak for others but at least for me
> upcoming 2 weeks would be mostly offline so I cannot really contribute
> much. A revert would be much more easier from the coordination POV IMHO.
> 
> Also I do not think there is any strong reason to rush this in. I do not
> really see any major problems to have this extension in 6.2

I'll queue the revert in mm-unstable with a plan to merge it upstream
around the -rc5 timeframe if there hasn't been resolution.

Please check Mina's issues with this revert's current changelog and
perhaps send along a revised one.

Michal Hocko Jan. 3, 2023, 8:37 a.m. UTC | #7

[Sorry I was offline]

On Mon 19-12-22 14:42:52, Andrew Morton wrote:
> On Sat, 17 Dec 2022 10:57:06 +0100 Michal Hocko <mhocko@suse.com> wrote:
> 
> > > I think it's a bit premature to revert at this stage.  Possibly we can
> > > get to the desired end state by modifying the existing code.  Possibly
> > > we can get to the desired end state by reverting this and by adding
> > > something new.
> > 
> > Sure if we can converge to a proper implementation during the rc phase
> > then it would be ok. I cannot speak for others but at least for me
> > upcoming 2 weeks would be mostly offline so I cannot really contribute
> > much. A revert would be much more easier from the coordination POV IMHO.
> > 
> > Also I do not think there is any strong reason to rush this in. I do not
> > really see any major problems to have this extension in 6.2
> 
> I'll queue the revert in mm-unstable with a plan to merge it upstream
> around the -rc5 timeframe if there hasn't been resolution.

Thanks! I do not really think we need to rush node based reclaim and
better have a reasonable and more futureproof interface.

> Please check Mina's issues with this revert's current changelog and
> perhaps send along a revised one.

Yes, I believe, I have addressed all the feedback but I am open to alter
the wording of course. The biggest concern by Mina IIRC was that the
nr_reclaimed reporting has been a pre-existing problem. And I agree with
that. The thing is that this doesn't matter without node specification
because the memory gets reclaimed even if the reported value is over
accounted. With nodemask specification the value becomes bogus if no
demotion nodes are specified because no memory gets reclaimed
potentially while the success is still reported. Mina has tried to
address that but I am not convinced the fix is actually future proof.

This really requires more discussion.

Huang, Ying Jan. 4, 2023, 8:41 a.m. UTC | #8

Michal Hocko <mhocko@suse.com> writes:

[snip]

> This really requires more discussion.

Let's start the discussion with some summary.

Requirements:

- Proactive reclaim.  The counting of current per-memcg proactive
  reclaim (memory.reclaim) isn't correct.  The demoted, but not
  reclaimed pages will be counted as reclaimed.  So "echo XXM >
  memory.reclaim" may exit prematurely before the specified number of
  memory is reclaimed.

- Proactive demote.  We need an interface to do per-memcg proactive
  demote.  We may reuse memory.reclaim via extending the concept of
  reclaiming to include demoting.  Or, we can add a new interface for
  that (for example, memory.demote).  In addition to demote from fast
  tier to slow tier, in theory, we may need to demote from a set of
  nodes to another set of nodes for something like general node
  balancing.

- Proactive promote.  In theory, this is possible, but there's no real
  life requirements yet.  And it should use a separate interface, so I
  don't think we need to discuss that here.

Open questions:

- Use memory.reclaim or memory.demote for proactive demote.  In current
  memcg context, reclaiming and demoting is quite different, because
  reclaiming will uncharge, while demoting will not.  But if we will add
  per-memory-tier charging finally, the difference disappears.  So the
  question becomes whether will we add per-memory-tier charging.

- Whether should we demote from faster tier nodes to lower tier nodes
  during the proactive reclaiming.  Choice A is to keep as much fast
  memory as possible.  That is, reclaim from the lowest tier nodes
  firstly, then the secondary lowest tier nodes, and so on.  Choice B is
  to demote at the same time of reclaiming.  In this way, if we
  proactively reclaim XX MB memory, we may free XX MB memory on the
  fastest memory nodes.

- When we proactively demote some memory from a fast memory tier, should
  we trigger memory competition in the slower memory tiers?  That is,
  whether to wake up kswapd of the slower memory tiers nodes?  If we
  want to make per-memcg proactive demoting to be per-memcg strictly, we
  should avoid to trigger the global behavior such as triggering memory
  competition in the slower memory tiers.  Instead, we can add a global
  proactive demote interface for that (such as per-memory-tier or
  per-node).

Best Regards,
Huang, Ying

Michal Hocko Jan. 18, 2023, 5:21 p.m. UTC | #9

On Wed 04-01-23 16:41:50, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> [snip]
> 
> > This really requires more discussion.
> 
> Let's start the discussion with some summary.
> 
> Requirements:
> 
> - Proactive reclaim.  The counting of current per-memcg proactive
>   reclaim (memory.reclaim) isn't correct.  The demoted, but not
>   reclaimed pages will be counted as reclaimed.  So "echo XXM >
>   memory.reclaim" may exit prematurely before the specified number of
>   memory is reclaimed.

This is reportedly a problem because memory.reclaim interface cannot be
used for proper memcg sizing IIRC.

> - Proactive demote.  We need an interface to do per-memcg proactive
>   demote.

For the further discussion it would be useful to reference the usecase
that is requiring this functionality. I believe this has been mentioned
somewhere but having it in this thread would help.

> We may reuse memory.reclaim via extending the concept of
>   reclaiming to include demoting.  Or, we can add a new interface for
>   that (for example, memory.demote).  In addition to demote from fast
>   tier to slow tier, in theory, we may need to demote from a set of
>   nodes to another set of nodes for something like general node
>   balancing.
> 
> - Proactive promote.  In theory, this is possible, but there's no real
>   life requirements yet.  And it should use a separate interface, so I
>   don't think we need to discuss that here.

Yes, proactive promotion is not backed by any real usecase at the
moment. We do not really have to focus on it but we should be aware of
the posibility and alow future extentions towards that functionality.
 
There is one requirement missing here.
 - Per NUMA node control - this is what makes the distinction between
   demotion and charge reclaim really semantically challenging - e.g.
   should demotions constrained by the provided nodemask or they should
   be implicit?

> Open questions:
> 
> - Use memory.reclaim or memory.demote for proactive demote.  In current
>   memcg context, reclaiming and demoting is quite different, because
>   reclaiming will uncharge, while demoting will not.  But if we will add
>   per-memory-tier charging finally, the difference disappears.  So the
>   question becomes whether will we add per-memory-tier charging.

The question is not whether but when IMHO. We've had a similar situation
with the swap accounting. Originally we have considered swap as a shared
resource but cgroupv2 goes with per swap limits because contention for
the swap space is really something people do care about.

> - Whether should we demote from faster tier nodes to lower tier nodes
>   during the proactive reclaiming.

I thought we are aligned on that. Demotion is a part of aging and that
is an integral part of the reclaim.

>   Choice A is to keep as much fast
>   memory as possible.  That is, reclaim from the lowest tier nodes
>   firstly, then the secondary lowest tier nodes, and so on.  Choice B is
>   to demote at the same time of reclaiming.  In this way, if we
>   proactively reclaim XX MB memory, we may free XX MB memory on the
>   fastest memory nodes.
> 
> - When we proactively demote some memory from a fast memory tier, should
>   we trigger memory competition in the slower memory tiers?  That is,
>   whether to wake up kswapd of the slower memory tiers nodes?

Johannes made some very strong arguments that there is no other choice
than involve kswapd (https://lore.kernel.org/all/Y5nEQeXj6HQBEHEY@cmpxchg.org/).

>   If we
>   want to make per-memcg proactive demoting to be per-memcg strictly, we
>   should avoid to trigger the global behavior such as triggering memory
>   competition in the slower memory tiers.  Instead, we can add a global
>   proactive demote interface for that (such as per-memory-tier or
>   per-node).

I suspect we are left with a real usecase and then follow the path we
took for the swap accounting.

Other open questions I do see are
- what to do when the memory.reclaim is constrained by a nodemask as
  mentioned above. Is the whole reclaim process (including aging) bound to
  the given nodemask or does demotion escape from it.
- should the demotion be specific to multi-tier systems or the interface
  should be just NUMA based and users could use the scheme to shuffle
  memory around and allow numa balancing from userspace that way. That
  would imply that demotion is a dedicated interface of course.
- there are other usecases that would like to trigger aging from
  userspace (http://lkml.kernel.org/r/20221214225123.2770216-1-yuanchu@google.com).
  Isn't demotion just a special case of aging in general or should we
  end up with 3 different interfaces?

Huang, Ying Jan. 19, 2023, 8:29 a.m. UTC | #10

Michal Hocko <mhocko@suse.com> writes:

> On Wed 04-01-23 16:41:50, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> [snip]
>> 
>> > This really requires more discussion.
>> 
>> Let's start the discussion with some summary.
>> 
>> Requirements:
>> 
>> - Proactive reclaim.  The counting of current per-memcg proactive
>>   reclaim (memory.reclaim) isn't correct.  The demoted, but not
>>   reclaimed pages will be counted as reclaimed.  So "echo XXM >
>>   memory.reclaim" may exit prematurely before the specified number of
>>   memory is reclaimed.
>
> This is reportedly a problem because memory.reclaim interface cannot be
> used for proper memcg sizing IIRC.
>
>> - Proactive demote.  We need an interface to do per-memcg proactive
>>   demote.
>
> For the further discussion it would be useful to reference the usecase
> that is requiring this functionality. I believe this has been mentioned
> somewhere but having it in this thread would help.

Sure.

Google people in [1] and [2] request a per-cgroup interface to demote
but not reclaim proactively.

"
For jobs of some latency tiers, we would like to trigger proactive
demotion (which incurs relatively low latency on the job), but not
trigger proactive reclaim (which incurs a pagefault).
"

Meta people (Johannes) in [3] say they used per-cgroup memory.reclaim
for demote and reclaim proactively.

 [1] https://lore.kernel.org/linux-mm/CAHS8izM-XdLgFrQ1k13X-4YrK=JGayRXV_G3c3Qh4NLKP7cH_g@mail.gmail.com/
 [2] https://lore.kernel.org/linux-mm/CAJD7tkZNW=u1TD-Fd_3RuzRNtaFjxihbGm0836QHkdp0Nn-vyQ@mail.gmail.com/
 [3] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/

>> We may reuse memory.reclaim via extending the concept of
>>   reclaiming to include demoting.  Or, we can add a new interface for
>>   that (for example, memory.demote).  In addition to demote from fast
>>   tier to slow tier, in theory, we may need to demote from a set of
>>   nodes to another set of nodes for something like general node
>>   balancing.
>> 
>> - Proactive promote.  In theory, this is possible, but there's no real
>>   life requirements yet.  And it should use a separate interface, so I
>>   don't think we need to discuss that here.
>
> Yes, proactive promotion is not backed by any real usecase at the
> moment. We do not really have to focus on it but we should be aware of
> the posibility and alow future extentions towards that functionality.

OK.

> There is one requirement missing here.
>  - Per NUMA node control - this is what makes the distinction between
>    demotion and charge reclaim really semantically challenging - e.g.
>    should demotions constrained by the provided nodemask or they should
>    be implicit?

Yes.  We may need to specify the NUMA nodes for demotion/reclaiming
source, target, or even path.  That is, to fine control the proactive
demotion/reclaiming.

>> Open questions:
>> 
>> - Use memory.reclaim or memory.demote for proactive demote.  In current
>>   memcg context, reclaiming and demoting is quite different, because
>>   reclaiming will uncharge, while demoting will not.  But if we will add
>>   per-memory-tier charging finally, the difference disappears.  So the
>>   question becomes whether will we add per-memory-tier charging.
>
> The question is not whether but when IMHO. We've had a similar situation
> with the swap accounting. Originally we have considered swap as a shared
> resource but cgroupv2 goes with per swap limits because contention for
> the swap space is really something people do care about.

So, when we design user space interface for proactive demotion, we
should keep per-memory-tier charging in mind.

>> - Whether should we demote from faster tier nodes to lower tier nodes
>>   during the proactive reclaiming.
>
> I thought we are aligned on that. Demotion is a part of aging and that
> is an integral part of the reclaim.

As in the choice A/B of the below text, we should keep more fast memory
size or slow memory size?  For original active/inactive LRU lists, we
will balance the size of lists.  But we don't have similar stuff for the
memory tiers.  What is the preferred balancing policy?  Choice A/B below
are 2 extreme policies that are defined clearly.

>>   Choice A is to keep as much fast
>>   memory as possible.  That is, reclaim from the lowest tier nodes
>>   firstly, then the secondary lowest tier nodes, and so on.  Choice B is
>>   to demote at the same time of reclaiming.  In this way, if we
>>   proactively reclaim XX MB memory, we may free XX MB memory on the
>>   fastest memory nodes.
>> 
>> - When we proactively demote some memory from a fast memory tier, should
>>   we trigger memory competition in the slower memory tiers?  That is,
>>   whether to wake up kswapd of the slower memory tiers nodes?
>
> Johannes made some very strong arguments that there is no other choice
> than involve kswapd (https://lore.kernel.org/all/Y5nEQeXj6HQBEHEY@cmpxchg.org/).

I have no objection for that too.  The below is just another choice.  If
people don't think it's useful.  I will not insist on it.

>>   If we
>>   want to make per-memcg proactive demoting to be per-memcg strictly, we
>>   should avoid to trigger the global behavior such as triggering memory
>>   competition in the slower memory tiers.  Instead, we can add a global
>>   proactive demote interface for that (such as per-memory-tier or
>>   per-node).
>
> I suspect we are left with a real usecase and then follow the path we
> took for the swap accounting.

Thanks for adding that.

> Other open questions I do see are
> - what to do when the memory.reclaim is constrained by a nodemask as
>   mentioned above. Is the whole reclaim process (including aging) bound to
>   the given nodemask or does demotion escape from it.

Per my understanding, we can use multiple node masks if necessary.  For
example, for "source=<mask1>", we may demote from <mask1> to other
nodes; for "source=<mask1> destination=<mask2>", we will demote from
<mask1> to <mask2>, but will not demote to other nodes.

> - should the demotion be specific to multi-tier systems or the interface
>   should be just NUMA based and users could use the scheme to shuffle
>   memory around and allow numa balancing from userspace that way. That
>   would imply that demotion is a dedicated interface of course.

It appears that if we can force the demotion target nodes (even in the
same tier).  We can implement numa balancing from user space?

> - there are other usecases that would like to trigger aging from
>   userspace (http://lkml.kernel.org/r/20221214225123.2770216-1-yuanchu@google.com).
>   Isn't demotion just a special case of aging in general or should we
>   end up with 3 different interfaces?

Thanks for pointer!  If my understanding were correct, this appears a
user of proactive reclaiming/demotion interface?  Cced the patch author
for any further requirements for the interface.

Best Regards,
Huang, Ying

Revert "mm: add nodes= arg to memory.reclaim"

Commit Message

Comments

Patch