Message ID | Y5xASNe1x8cusiTx@dhcp22.suse.cz (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Revert "mm: add nodes= arg to memory.reclaim" | expand |
On Fri, Dec 16, 2022 at 1:54 AM Michal Hocko <mhocko@suse.com> wrote: > > Andrew, > I have noticed that the patch made it into Linus tree already. Can we > please revert it because the semantic is not really clear and we should > really not create yet another user API maintenance problem. I am > proposing to revert the nodemask extension for now before we grow any > upstream users. Deeper in the email thread are some proposals how to > move forward with that. There are proposals, many which have been rejected due to not addressing the motivating use cases and others that have been rejected by fellow maintainers, and some that are awaiting feedback. No, there is no other clear-cut way forward for this use case right now. I have found the merged approach by far the most agreeable so far. > --- > From 7c5285f1725d5abfcae5548ab0d73be9ceded2a1 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Fri, 16 Dec 2022 10:46:33 +0100 > Subject: [PATCH] Revert "mm: add nodes= arg to memory.reclaim" > > This reverts commit 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5. > > Although it is recognized that a finer grained pro-active reclaim is > something we need and want the semantic of this implementation is really > ambiguous. > > From a follow up discussion it became clear that there are two essential > usecases here. One is to use memory.reclaim to pro-actively reclaim > memory and expectation is that the requested and reported amount of memory is > uncharged from the memcg. Another usecase focuses on pro-active demotion > when the memory is merely shuffled around to demotion targets while the > overall charged memory stays unchanged. > > The current implementation considers demoted pages as reclaimed and that > break both usecases. I think you're making it sound like this specific patch broke both use cases, and IMO that is not accurate. commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim"") has been in the tree for around 7 months now and that is the commit that enabled demotion in memcg reclaim, and implicitly counted demoted pages as reclaimed in memcg reclaim, which is the source of the ambiguity. Not the patch that you are reverting here. The irony I find with this revert is that this patch actually removes the ambiguity and does not exacerbate it. Currently using memory.reclaim _without_ the nodes= arg is ambiguous because demoted pages count as reclaimed. On the other hand using memory.reclaim _with_ the nodes= arg is completely unambiguous: the kernel will demote-only from top tier nodes and reclaim-only from bottom tier nodes. > [1] has tried to address the reporting part but > there are more issues with that summarized in [2] and follow up emails. > I am the one that put effort into resolving the ambiguity introduced by commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim"") and proposed [1]. Reverting this patch does nothing to resolve ambiguity that it did not introduce. > Let's revert the nodemask based extension of the memcg pro-active > reclaim for now until we settle with a more robust semantic. > I do not think we should revert this. It enables a couple of important use cases for Google: 1. Enables us to specifically trigger proactive reclaim in a memcg on a memory tiered system by specifying only the lower tiered nodes using the nodes= arg. 2. Enabled us to specifically trigger proactive demotion in a memcg on a memory tiered system by specifying only the top tier nodes using the nodes= arg. Both use cases are broken with this revert, and no progress to resolve the ambiguity is made with this revert. I agree with Michal that there is ambiguity that has existed in the kernel for about 7 months now and is introduced by commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), and I'm trying to fix this ambiguity in [1]. I think we should move forward in fixing the ambiguity through the review of the patch in [1] and not revert patches that enable useful use-cases and did not introduce the ambiguity. > [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com Broken link. Actual link to my patch to fix the ambiguity: [1] https://lore.kernel.org/linux-mm/20221206023406.3182800-1-almasrymina@google.com/ > [2] http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > Documentation/admin-guide/cgroup-v2.rst | 15 +++--- > include/linux/swap.h | 3 +- > mm/memcontrol.c | 67 +++++-------------------- > mm/vmscan.c | 4 +- > 4 files changed, 21 insertions(+), 68 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index c8ae7c897f14..74cec76be9f2 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1245,13 +1245,17 @@ PAGE_SIZE multiple when read back. > This is a simple interface to trigger memory reclaim in the > target cgroup. > > - This file accepts a string which contains the number of bytes to > - reclaim. > + This file accepts a single key, the number of bytes to reclaim. > + No nested keys are currently supported. > > Example:: > > echo "1G" > memory.reclaim > > + The interface can be later extended with nested keys to > + configure the reclaim behavior. For example, specify the > + type of memory to reclaim from (anon, file, ..). > + > Please note that the kernel can over or under reclaim from > the target cgroup. If less bytes are reclaimed than the > specified amount, -EAGAIN is returned. > @@ -1263,13 +1267,6 @@ PAGE_SIZE multiple when read back. > This means that the networking layer will not adapt based on > reclaim induced by memory.reclaim. > > - This file also allows the user to specify the nodes to reclaim from, > - via the 'nodes=' key, for example:: > - > - echo "1G nodes=0,1" > memory.reclaim > - > - The above instructs the kernel to reclaim memory from nodes 0,1. > - > memory.peak > A read-only single value file which exists on non-root > cgroups. > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 2787b84eaf12..0ceed49516ad 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -418,8 +418,7 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options, > - nodemask_t *nodemask); > + unsigned int reclaim_options); > extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, > gfp_t gfp_mask, bool noswap, > pg_data_t *pgdat, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ab457f0394ab..73afff8062f9 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -63,7 +63,6 @@ > #include <linux/resume_user_mode.h> > #include <linux/psi.h> > #include <linux/seq_buf.h> > -#include <linux/parser.h> > #include "internal.h" > #include <net/sock.h> > #include <net/ip.h> > @@ -2393,8 +2392,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, > psi_memstall_enter(&pflags); > nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, > gfp_mask, > - MEMCG_RECLAIM_MAY_SWAP, > - NULL); > + MEMCG_RECLAIM_MAY_SWAP); > psi_memstall_leave(&pflags); > } while ((memcg = parent_mem_cgroup(memcg)) && > !mem_cgroup_is_root(memcg)); > @@ -2685,8 +2683,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > > psi_memstall_enter(&pflags); > nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, > - gfp_mask, reclaim_options, > - NULL); > + gfp_mask, reclaim_options); > psi_memstall_leave(&pflags); > > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > @@ -3506,8 +3503,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, > } > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, > - NULL)) { > + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { > ret = -EBUSY; > break; > } > @@ -3618,8 +3614,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) > return -EINTR; > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - MEMCG_RECLAIM_MAY_SWAP, > - NULL)) > + MEMCG_RECLAIM_MAY_SWAP)) > nr_retries--; > } > > @@ -6429,8 +6424,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > } > > reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, > - NULL); > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); > > if (!reclaimed && !nr_retries--) > break; > @@ -6479,8 +6473,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, > > if (nr_reclaims) { > if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, > - NULL)) > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) > nr_reclaims--; > continue; > } > @@ -6603,54 +6596,21 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > return nbytes; > } > > -enum { > - MEMORY_RECLAIM_NODES = 0, > - MEMORY_RECLAIM_NULL, > -}; > - > -static const match_table_t if_tokens = { > - { MEMORY_RECLAIM_NODES, "nodes=%s" }, > - { MEMORY_RECLAIM_NULL, NULL }, > -}; > - > static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > size_t nbytes, loff_t off) > { > struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > unsigned int nr_retries = MAX_RECLAIM_RETRIES; > unsigned long nr_to_reclaim, nr_reclaimed = 0; > - unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP | > - MEMCG_RECLAIM_PROACTIVE; > - char *old_buf, *start; > - substring_t args[MAX_OPT_ARGS]; > - int token; > - char value[256]; > - nodemask_t nodemask = NODE_MASK_ALL; > - > - buf = strstrip(buf); > - > - old_buf = buf; > - nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; > - if (buf == old_buf) > - return -EINVAL; > + unsigned int reclaim_options; > + int err; > > buf = strstrip(buf); > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > + if (err) > + return err; > > - while ((start = strsep(&buf, " ")) != NULL) { > - if (!strlen(start)) > - continue; > - token = match_token(start, if_tokens, args); > - match_strlcpy(value, args, sizeof(value)); > - switch (token) { > - case MEMORY_RECLAIM_NODES: > - if (nodelist_parse(value, nodemask) < 0) > - return -EINVAL; > - break; > - default: > - return -EINVAL; > - } > - } > - > + reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; > while (nr_reclaimed < nr_to_reclaim) { > unsigned long reclaimed; > > @@ -6667,8 +6627,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > nr_to_reclaim - nr_reclaimed, > - GFP_KERNEL, reclaim_options, > - &nodemask); > + GFP_KERNEL, reclaim_options); > > if (!reclaimed && !nr_retries--) > return -EAGAIN; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index aba991c505f1..546540bc770a 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -6757,8 +6757,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, > unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options, > - nodemask_t *nodemask) > + unsigned int reclaim_options) > { > unsigned long nr_reclaimed; > unsigned int noreclaim_flag; > @@ -6773,7 +6772,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > .may_unmap = 1, > .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), > .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), > - .nodemask = nodemask, > }; > /* > * Traverse the ZONELIST_FALLBACK zonelist of the current node to put > -- > 2.30.2 > > -- > Michal Hocko > SUSE Labs
On Fri 16-12-22 04:02:12, Mina Almasry wrote: > On Fri, Dec 16, 2022 at 1:54 AM Michal Hocko <mhocko@suse.com> wrote: > > > > Andrew, > > I have noticed that the patch made it into Linus tree already. Can we > > please revert it because the semantic is not really clear and we should > > really not create yet another user API maintenance problem. I am > > proposing to revert the nodemask extension for now before we grow any > > upstream users. Deeper in the email thread are some proposals how to > > move forward with that. > > There are proposals, many which have been rejected due to not > addressing the motivating use cases and others that have been rejected > by fellow maintainers, and some that are awaiting feedback. No, there > is no other clear-cut way forward for this use case right now. I have > found the merged approach by far the most agreeable so far. There is a clear need for further discussion and until then we do not want to expose interface and create dependencies that will inevitably hard to change the semantic later. > > From 7c5285f1725d5abfcae5548ab0d73be9ceded2a1 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@suse.com> > > Date: Fri, 16 Dec 2022 10:46:33 +0100 > > Subject: [PATCH] Revert "mm: add nodes= arg to memory.reclaim" > > > > This reverts commit 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5. > > > > Although it is recognized that a finer grained pro-active reclaim is > > something we need and want the semantic of this implementation is really > > ambiguous. > > > > From a follow up discussion it became clear that there are two essential > > usecases here. One is to use memory.reclaim to pro-actively reclaim > > memory and expectation is that the requested and reported amount of memory is > > uncharged from the memcg. Another usecase focuses on pro-active demotion > > when the memory is merely shuffled around to demotion targets while the > > overall charged memory stays unchanged. > > > > The current implementation considers demoted pages as reclaimed and that > > break both usecases. > > I think you're making it sound like this specific patch broke both use > cases, and IMO that is not accurate. commit 3f1509c57b1b ("Revert > "mm/vmscan: never demote for memcg reclaim"") has been in the tree for > around 7 months now and that is the commit that enabled demotion in > memcg reclaim, and implicitly counted demoted pages as reclaimed in > memcg reclaim, which is the source of the ambiguity. Not the patch > that you are reverting here. > > The irony I find with this revert is that this patch actually removes > the ambiguity and does not exacerbate it. Currently using > memory.reclaim _without_ the nodes= arg is ambiguous because demoted > pages count as reclaimed. On the other hand using memory.reclaim > _with_ the nodes= arg is completely unambiguous: the kernel will > demote-only from top tier nodes and reclaim-only from bottom tier > nodes. Yes, demoted patches are indeed counted as reclaimed but that is not a major issue because from the external point of view charges are getting reclaimed. It is nodes specification which makes the latent problem much more obvious. > > > [1] has tried to address the reporting part but > > there are more issues with that summarized in [2] and follow up emails. > > > > I am the one that put effort into resolving the ambiguity introduced > by commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > reclaim"") and proposed [1]. Reverting this patch does nothing to > resolve ambiguity that it did not introduce. > > > Let's revert the nodemask based extension of the memcg pro-active > > reclaim for now until we settle with a more robust semantic. > > > > I do not think we should revert this. It enables a couple of important > use cases for Google: > > 1. Enables us to specifically trigger proactive reclaim in a memcg on > a memory tiered system by specifying only the lower tiered nodes using > the nodes= arg. > 2. Enabled us to specifically trigger proactive demotion in a memcg on > a memory tiered system by specifying only the top tier nodes using the > nodes= arg. That is clear and the aim of the revert is not to disallow those usecases. We just need a clear and futureproof interface for that. Changing the semantic after the fact is a nogo, hence the revert. > > Both use cases are broken with this revert, and no progress to resolve > the ambiguity is made with this revert. There cannot be any regression with the revert now because the code hasn't been upstream. So let's remove the interface until we can agree on the exact semantic and build the interface from there.
On 12/16/22 16:54, Michal Hocko wrote: > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index c8ae7c897f14..74cec76be9f2 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1245,13 +1245,17 @@ PAGE_SIZE multiple when read back. > This is a simple interface to trigger memory reclaim in the > target cgroup. > > - This file accepts a string which contains the number of bytes to > - reclaim. > + This file accepts a single key, the number of bytes to reclaim. > + No nested keys are currently supported. > > Example:: > > echo "1G" > memory.reclaim > > + The interface can be later extended with nested keys to > + configure the reclaim behavior. For example, specify the > + type of memory to reclaim from (anon, file, ..). > + > Please note that the kernel can over or under reclaim from > the target cgroup. If less bytes are reclaimed than the > specified amount, -EAGAIN is returned. > @@ -1263,13 +1267,6 @@ PAGE_SIZE multiple when read back. > This means that the networking layer will not adapt based on > reclaim induced by memory.reclaim. > > - This file also allows the user to specify the nodes to reclaim from, > - via the 'nodes=' key, for example:: > - > - echo "1G nodes=0,1" > memory.reclaim > - > - The above instructs the kernel to reclaim memory from nodes 0,1. > - > memory.peak > A read-only single value file which exists on non-root > cgroups. Ah! I forgot to add my Reviewed-by: tag when the original patch was submitted. However, I was Cc'ed the revert presumably due to Cc: tag in the original. In any case, for the documentation part: Acked-by: Bagas Sanjaya <bagasdotme@gmail.com>
On Fri, 16 Dec 2022 10:54:16 +0100 Michal Hocko <mhocko@suse.com> wrote: > I have noticed that the patch made it into Linus tree already. Can we > please revert it because the semantic is not really clear and we should > really not create yet another user API maintenance problem. Well dang. I was waiting for the discussion to converge, blissfully unaware that the thing was sitting in mm-stable :( I guess the Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Muchun Song <songmuchun@bytedance.com> fooled me. I think it's a bit premature to revert at this stage. Possibly we can get to the desired end state by modifying the existing code. Possibly we can get to the desired end state by reverting this and by adding something new. If we can't get to the desired end state at all then yes, I'll send Linus a revert of this patch later in this -rc cycle.
On Fri 16-12-22 10:18:20, Andrew Morton wrote: > On Fri, 16 Dec 2022 10:54:16 +0100 Michal Hocko <mhocko@suse.com> wrote: > > > I have noticed that the patch made it into Linus tree already. Can we > > please revert it because the semantic is not really clear and we should > > really not create yet another user API maintenance problem. > > Well dang. I was waiting for the discussion to converge, blissfully > unaware that the thing was sitting in mm-stable :( I guess the > > Acked-by: Michal Hocko <mhocko@suse.com> > Acked-by: Shakeel Butt <shakeelb@google.com> > Acked-by: Muchun Song <songmuchun@bytedance.com> > > fooled me. Hmm, as pointed out in http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz I've failed to see through all the consequences of the implementation. SO my bad here to add my ack before fully understanding all the implications. > I think it's a bit premature to revert at this stage. Possibly we can > get to the desired end state by modifying the existing code. Possibly > we can get to the desired end state by reverting this and by adding > something new. Sure if we can converge to a proper implementation during the rc phase then it would be ok. I cannot speak for others but at least for me upcoming 2 weeks would be mostly offline so I cannot really contribute much. A revert would be much more easier from the coordination POV IMHO. Also I do not think there is any strong reason to rush this in. I do not really see any major problems to have this extension in 6.2
On Sat, 17 Dec 2022 10:57:06 +0100 Michal Hocko <mhocko@suse.com> wrote: > > I think it's a bit premature to revert at this stage. Possibly we can > > get to the desired end state by modifying the existing code. Possibly > > we can get to the desired end state by reverting this and by adding > > something new. > > Sure if we can converge to a proper implementation during the rc phase > then it would be ok. I cannot speak for others but at least for me > upcoming 2 weeks would be mostly offline so I cannot really contribute > much. A revert would be much more easier from the coordination POV IMHO. > > Also I do not think there is any strong reason to rush this in. I do not > really see any major problems to have this extension in 6.2 I'll queue the revert in mm-unstable with a plan to merge it upstream around the -rc5 timeframe if there hasn't been resolution. Please check Mina's issues with this revert's current changelog and perhaps send along a revised one.
[Sorry I was offline] On Mon 19-12-22 14:42:52, Andrew Morton wrote: > On Sat, 17 Dec 2022 10:57:06 +0100 Michal Hocko <mhocko@suse.com> wrote: > > > > I think it's a bit premature to revert at this stage. Possibly we can > > > get to the desired end state by modifying the existing code. Possibly > > > we can get to the desired end state by reverting this and by adding > > > something new. > > > > Sure if we can converge to a proper implementation during the rc phase > > then it would be ok. I cannot speak for others but at least for me > > upcoming 2 weeks would be mostly offline so I cannot really contribute > > much. A revert would be much more easier from the coordination POV IMHO. > > > > Also I do not think there is any strong reason to rush this in. I do not > > really see any major problems to have this extension in 6.2 > > I'll queue the revert in mm-unstable with a plan to merge it upstream > around the -rc5 timeframe if there hasn't been resolution. Thanks! I do not really think we need to rush node based reclaim and better have a reasonable and more futureproof interface. > Please check Mina's issues with this revert's current changelog and > perhaps send along a revised one. Yes, I believe, I have addressed all the feedback but I am open to alter the wording of course. The biggest concern by Mina IIRC was that the nr_reclaimed reporting has been a pre-existing problem. And I agree with that. The thing is that this doesn't matter without node specification because the memory gets reclaimed even if the reported value is over accounted. With nodemask specification the value becomes bogus if no demotion nodes are specified because no memory gets reclaimed potentially while the success is still reported. Mina has tried to address that but I am not convinced the fix is actually future proof. This really requires more discussion.
Michal Hocko <mhocko@suse.com> writes:
[snip]
> This really requires more discussion.
Let's start the discussion with some summary.
Requirements:
- Proactive reclaim. The counting of current per-memcg proactive
reclaim (memory.reclaim) isn't correct. The demoted, but not
reclaimed pages will be counted as reclaimed. So "echo XXM >
memory.reclaim" may exit prematurely before the specified number of
memory is reclaimed.
- Proactive demote. We need an interface to do per-memcg proactive
demote. We may reuse memory.reclaim via extending the concept of
reclaiming to include demoting. Or, we can add a new interface for
that (for example, memory.demote). In addition to demote from fast
tier to slow tier, in theory, we may need to demote from a set of
nodes to another set of nodes for something like general node
balancing.
- Proactive promote. In theory, this is possible, but there's no real
life requirements yet. And it should use a separate interface, so I
don't think we need to discuss that here.
Open questions:
- Use memory.reclaim or memory.demote for proactive demote. In current
memcg context, reclaiming and demoting is quite different, because
reclaiming will uncharge, while demoting will not. But if we will add
per-memory-tier charging finally, the difference disappears. So the
question becomes whether will we add per-memory-tier charging.
- Whether should we demote from faster tier nodes to lower tier nodes
during the proactive reclaiming. Choice A is to keep as much fast
memory as possible. That is, reclaim from the lowest tier nodes
firstly, then the secondary lowest tier nodes, and so on. Choice B is
to demote at the same time of reclaiming. In this way, if we
proactively reclaim XX MB memory, we may free XX MB memory on the
fastest memory nodes.
- When we proactively demote some memory from a fast memory tier, should
we trigger memory competition in the slower memory tiers? That is,
whether to wake up kswapd of the slower memory tiers nodes? If we
want to make per-memcg proactive demoting to be per-memcg strictly, we
should avoid to trigger the global behavior such as triggering memory
competition in the slower memory tiers. Instead, we can add a global
proactive demote interface for that (such as per-memory-tier or
per-node).
Best Regards,
Huang, Ying
On Wed 04-01-23 16:41:50, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > [snip] > > > This really requires more discussion. > > Let's start the discussion with some summary. > > Requirements: > > - Proactive reclaim. The counting of current per-memcg proactive > reclaim (memory.reclaim) isn't correct. The demoted, but not > reclaimed pages will be counted as reclaimed. So "echo XXM > > memory.reclaim" may exit prematurely before the specified number of > memory is reclaimed. This is reportedly a problem because memory.reclaim interface cannot be used for proper memcg sizing IIRC. > - Proactive demote. We need an interface to do per-memcg proactive > demote. For the further discussion it would be useful to reference the usecase that is requiring this functionality. I believe this has been mentioned somewhere but having it in this thread would help. > We may reuse memory.reclaim via extending the concept of > reclaiming to include demoting. Or, we can add a new interface for > that (for example, memory.demote). In addition to demote from fast > tier to slow tier, in theory, we may need to demote from a set of > nodes to another set of nodes for something like general node > balancing. > > - Proactive promote. In theory, this is possible, but there's no real > life requirements yet. And it should use a separate interface, so I > don't think we need to discuss that here. Yes, proactive promotion is not backed by any real usecase at the moment. We do not really have to focus on it but we should be aware of the posibility and alow future extentions towards that functionality. There is one requirement missing here. - Per NUMA node control - this is what makes the distinction between demotion and charge reclaim really semantically challenging - e.g. should demotions constrained by the provided nodemask or they should be implicit? > Open questions: > > - Use memory.reclaim or memory.demote for proactive demote. In current > memcg context, reclaiming and demoting is quite different, because > reclaiming will uncharge, while demoting will not. But if we will add > per-memory-tier charging finally, the difference disappears. So the > question becomes whether will we add per-memory-tier charging. The question is not whether but when IMHO. We've had a similar situation with the swap accounting. Originally we have considered swap as a shared resource but cgroupv2 goes with per swap limits because contention for the swap space is really something people do care about. > - Whether should we demote from faster tier nodes to lower tier nodes > during the proactive reclaiming. I thought we are aligned on that. Demotion is a part of aging and that is an integral part of the reclaim. > Choice A is to keep as much fast > memory as possible. That is, reclaim from the lowest tier nodes > firstly, then the secondary lowest tier nodes, and so on. Choice B is > to demote at the same time of reclaiming. In this way, if we > proactively reclaim XX MB memory, we may free XX MB memory on the > fastest memory nodes. > > - When we proactively demote some memory from a fast memory tier, should > we trigger memory competition in the slower memory tiers? That is, > whether to wake up kswapd of the slower memory tiers nodes? Johannes made some very strong arguments that there is no other choice than involve kswapd (https://lore.kernel.org/all/Y5nEQeXj6HQBEHEY@cmpxchg.org/). > If we > want to make per-memcg proactive demoting to be per-memcg strictly, we > should avoid to trigger the global behavior such as triggering memory > competition in the slower memory tiers. Instead, we can add a global > proactive demote interface for that (such as per-memory-tier or > per-node). I suspect we are left with a real usecase and then follow the path we took for the swap accounting. Other open questions I do see are - what to do when the memory.reclaim is constrained by a nodemask as mentioned above. Is the whole reclaim process (including aging) bound to the given nodemask or does demotion escape from it. - should the demotion be specific to multi-tier systems or the interface should be just NUMA based and users could use the scheme to shuffle memory around and allow numa balancing from userspace that way. That would imply that demotion is a dedicated interface of course. - there are other usecases that would like to trigger aging from userspace (http://lkml.kernel.org/r/20221214225123.2770216-1-yuanchu@google.com). Isn't demotion just a special case of aging in general or should we end up with 3 different interfaces?
Michal Hocko <mhocko@suse.com> writes: > On Wed 04-01-23 16:41:50, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: >> >> [snip] >> >> > This really requires more discussion. >> >> Let's start the discussion with some summary. >> >> Requirements: >> >> - Proactive reclaim. The counting of current per-memcg proactive >> reclaim (memory.reclaim) isn't correct. The demoted, but not >> reclaimed pages will be counted as reclaimed. So "echo XXM > >> memory.reclaim" may exit prematurely before the specified number of >> memory is reclaimed. > > This is reportedly a problem because memory.reclaim interface cannot be > used for proper memcg sizing IIRC. > >> - Proactive demote. We need an interface to do per-memcg proactive >> demote. > > For the further discussion it would be useful to reference the usecase > that is requiring this functionality. I believe this has been mentioned > somewhere but having it in this thread would help. Sure. Google people in [1] and [2] request a per-cgroup interface to demote but not reclaim proactively. " For jobs of some latency tiers, we would like to trigger proactive demotion (which incurs relatively low latency on the job), but not trigger proactive reclaim (which incurs a pagefault). " Meta people (Johannes) in [3] say they used per-cgroup memory.reclaim for demote and reclaim proactively. [1] https://lore.kernel.org/linux-mm/CAHS8izM-XdLgFrQ1k13X-4YrK=JGayRXV_G3c3Qh4NLKP7cH_g@mail.gmail.com/ [2] https://lore.kernel.org/linux-mm/CAJD7tkZNW=u1TD-Fd_3RuzRNtaFjxihbGm0836QHkdp0Nn-vyQ@mail.gmail.com/ [3] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/ >> We may reuse memory.reclaim via extending the concept of >> reclaiming to include demoting. Or, we can add a new interface for >> that (for example, memory.demote). In addition to demote from fast >> tier to slow tier, in theory, we may need to demote from a set of >> nodes to another set of nodes for something like general node >> balancing. >> >> - Proactive promote. In theory, this is possible, but there's no real >> life requirements yet. And it should use a separate interface, so I >> don't think we need to discuss that here. > > Yes, proactive promotion is not backed by any real usecase at the > moment. We do not really have to focus on it but we should be aware of > the posibility and alow future extentions towards that functionality. OK. > There is one requirement missing here. > - Per NUMA node control - this is what makes the distinction between > demotion and charge reclaim really semantically challenging - e.g. > should demotions constrained by the provided nodemask or they should > be implicit? Yes. We may need to specify the NUMA nodes for demotion/reclaiming source, target, or even path. That is, to fine control the proactive demotion/reclaiming. >> Open questions: >> >> - Use memory.reclaim or memory.demote for proactive demote. In current >> memcg context, reclaiming and demoting is quite different, because >> reclaiming will uncharge, while demoting will not. But if we will add >> per-memory-tier charging finally, the difference disappears. So the >> question becomes whether will we add per-memory-tier charging. > > The question is not whether but when IMHO. We've had a similar situation > with the swap accounting. Originally we have considered swap as a shared > resource but cgroupv2 goes with per swap limits because contention for > the swap space is really something people do care about. So, when we design user space interface for proactive demotion, we should keep per-memory-tier charging in mind. >> - Whether should we demote from faster tier nodes to lower tier nodes >> during the proactive reclaiming. > > I thought we are aligned on that. Demotion is a part of aging and that > is an integral part of the reclaim. As in the choice A/B of the below text, we should keep more fast memory size or slow memory size? For original active/inactive LRU lists, we will balance the size of lists. But we don't have similar stuff for the memory tiers. What is the preferred balancing policy? Choice A/B below are 2 extreme policies that are defined clearly. >> Choice A is to keep as much fast >> memory as possible. That is, reclaim from the lowest tier nodes >> firstly, then the secondary lowest tier nodes, and so on. Choice B is >> to demote at the same time of reclaiming. In this way, if we >> proactively reclaim XX MB memory, we may free XX MB memory on the >> fastest memory nodes. >> >> - When we proactively demote some memory from a fast memory tier, should >> we trigger memory competition in the slower memory tiers? That is, >> whether to wake up kswapd of the slower memory tiers nodes? > > Johannes made some very strong arguments that there is no other choice > than involve kswapd (https://lore.kernel.org/all/Y5nEQeXj6HQBEHEY@cmpxchg.org/). I have no objection for that too. The below is just another choice. If people don't think it's useful. I will not insist on it. >> If we >> want to make per-memcg proactive demoting to be per-memcg strictly, we >> should avoid to trigger the global behavior such as triggering memory >> competition in the slower memory tiers. Instead, we can add a global >> proactive demote interface for that (such as per-memory-tier or >> per-node). > > I suspect we are left with a real usecase and then follow the path we > took for the swap accounting. Thanks for adding that. > Other open questions I do see are > - what to do when the memory.reclaim is constrained by a nodemask as > mentioned above. Is the whole reclaim process (including aging) bound to > the given nodemask or does demotion escape from it. Per my understanding, we can use multiple node masks if necessary. For example, for "source=<mask1>", we may demote from <mask1> to other nodes; for "source=<mask1> destination=<mask2>", we will demote from <mask1> to <mask2>, but will not demote to other nodes. > - should the demotion be specific to multi-tier systems or the interface > should be just NUMA based and users could use the scheme to shuffle > memory around and allow numa balancing from userspace that way. That > would imply that demotion is a dedicated interface of course. It appears that if we can force the demotion target nodes (even in the same tier). We can implement numa balancing from user space? > - there are other usecases that would like to trigger aging from > userspace (http://lkml.kernel.org/r/20221214225123.2770216-1-yuanchu@google.com). > Isn't demotion just a special case of aging in general or should we > end up with 3 different interfaces? Thanks for pointer! If my understanding were correct, this appears a user of proactive reclaiming/demotion interface? Cced the patch author for any further requirements for the interface. Best Regards, Huang, Ying
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index c8ae7c897f14..74cec76be9f2 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1245,13 +1245,17 @@ PAGE_SIZE multiple when read back. This is a simple interface to trigger memory reclaim in the target cgroup. - This file accepts a string which contains the number of bytes to - reclaim. + This file accepts a single key, the number of bytes to reclaim. + No nested keys are currently supported. Example:: echo "1G" > memory.reclaim + The interface can be later extended with nested keys to + configure the reclaim behavior. For example, specify the + type of memory to reclaim from (anon, file, ..). + Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned. @@ -1263,13 +1267,6 @@ PAGE_SIZE multiple when read back. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim. - This file also allows the user to specify the nodes to reclaim from, - via the 'nodes=' key, for example:: - - echo "1G nodes=0,1" > memory.reclaim - - The above instructs the kernel to reclaim memory from nodes 0,1. - memory.peak A read-only single value file which exists on non-root cgroups. diff --git a/include/linux/swap.h b/include/linux/swap.h index 2787b84eaf12..0ceed49516ad 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -418,8 +418,7 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options, - nodemask_t *nodemask); + unsigned int reclaim_options); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ab457f0394ab..73afff8062f9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -63,7 +63,6 @@ #include <linux/resume_user_mode.h> #include <linux/psi.h> #include <linux/seq_buf.h> -#include <linux/parser.h> #include "internal.h" #include <net/sock.h> #include <net/ip.h> @@ -2393,8 +2392,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEMCG_RECLAIM_MAY_SWAP, - NULL); + MEMCG_RECLAIM_MAY_SWAP); psi_memstall_leave(&pflags); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2685,8 +2683,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, psi_memstall_enter(&pflags); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, reclaim_options, - NULL); + gfp_mask, reclaim_options); psi_memstall_leave(&pflags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) @@ -3506,8 +3503,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, } if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, - NULL)) { + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { ret = -EBUSY; break; } @@ -3618,8 +3614,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) return -EINTR; if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - MEMCG_RECLAIM_MAY_SWAP, - NULL)) + MEMCG_RECLAIM_MAY_SWAP)) nr_retries--; } @@ -6429,8 +6424,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, } reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, - NULL); + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); if (!reclaimed && !nr_retries--) break; @@ -6479,8 +6473,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, - NULL)) + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) nr_reclaims--; continue; } @@ -6603,54 +6596,21 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } -enum { - MEMORY_RECLAIM_NODES = 0, - MEMORY_RECLAIM_NULL, -}; - -static const match_table_t if_tokens = { - { MEMORY_RECLAIM_NODES, "nodes=%s" }, - { MEMORY_RECLAIM_NULL, NULL }, -}; - static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; - unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP | - MEMCG_RECLAIM_PROACTIVE; - char *old_buf, *start; - substring_t args[MAX_OPT_ARGS]; - int token; - char value[256]; - nodemask_t nodemask = NODE_MASK_ALL; - - buf = strstrip(buf); - - old_buf = buf; - nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; - if (buf == old_buf) - return -EINVAL; + unsigned int reclaim_options; + int err; buf = strstrip(buf); + err = page_counter_memparse(buf, "", &nr_to_reclaim); + if (err) + return err; - while ((start = strsep(&buf, " ")) != NULL) { - if (!strlen(start)) - continue; - token = match_token(start, if_tokens, args); - match_strlcpy(value, args, sizeof(value)); - switch (token) { - case MEMORY_RECLAIM_NODES: - if (nodelist_parse(value, nodemask) < 0) - return -EINVAL; - break; - default: - return -EINVAL; - } - } - + reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; while (nr_reclaimed < nr_to_reclaim) { unsigned long reclaimed; @@ -6667,8 +6627,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_to_reclaim - nr_reclaimed, - GFP_KERNEL, reclaim_options, - &nodemask); + GFP_KERNEL, reclaim_options); if (!reclaimed && !nr_retries--) return -EAGAIN; diff --git a/mm/vmscan.c b/mm/vmscan.c index aba991c505f1..546540bc770a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -6757,8 +6757,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options, - nodemask_t *nodemask) + unsigned int reclaim_options) { unsigned long nr_reclaimed; unsigned int noreclaim_flag; @@ -6773,7 +6772,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_unmap = 1, .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), - .nodemask = nodemask, }; /* * Traverse the ZONELIST_FALLBACK zonelist of the current node to put