diff mbox series

[RFC] cgroup: introduce proportional protection on memcg

Message ID 1648113743-32622-1-git-send-email-zhaoyang.huang@unisoc.com (mailing list archive)
State New
Headers show
Series [RFC] cgroup: introduce proportional protection on memcg | expand

Commit Message

zhaoyang.huang March 24, 2022, 9:22 a.m. UTC
From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

current memcg protection via min,low,high asks for an evaluation of
protected entity, which could be hard for some system. Furthermore, the usage
could also be various under different scenarios(imagin keep protecting 50M when
usage change from 100M to 300M), which make the protection less meaning.
So we introduce the proportional protection over memcg's ever highest
usage(watermark) to overcome above constraints.

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
 include/linux/page_counter.h |  3 +++
 mm/memcontrol.c              | 17 +++++++++++++----
 2 files changed, 16 insertions(+), 4 deletions(-)

Comments

Chris Down March 24, 2022, 2:27 p.m. UTC | #1
I'm confused by the aims of this patch. We already have proportional reclaim 
for memory.min and memory.low, and memory.high is already "proportional" by its 
nature to drive memory back down behind the configured threshold.

Could you please be more clear about what you're trying to achieve and in what 
way the existing proportional reclaim mechanisms are insufficient for you?
Roman Gushchin March 24, 2022, 4:23 p.m. UTC | #2
It seems like what’s being proposed is an ability to express the protection in % of the current usage rather than an absolute number.
It’s an equivalent for something like a memory (reclaim) priority: e.g. a cgroup with 80% protection is _always_ reclaimed less aggressively than one with a 20% protection.

That said, I’m not a fan of this idea.
It might make sense in some reasonable range of usages, but if your workload is simply leaking memory and growing indefinitely, protecting it seems like a bad idea. And the first part can be easily achieved using an userspace tool.

Thanks!

> On Mar 24, 2022, at 7:33 AM, Chris Down <chris@chrisdown.name> wrote:
> 
> I'm confused by the aims of this patch. We already have proportional reclaim for memory.min and memory.low, and memory.high is already "proportional" by its nature to drive memory back down behind the configured threshold.
> 
> Could you please be more clear about what you're trying to achieve and in what way the existing proportional reclaim mechanisms are insufficient for you?
>
Zhaoyang Huang March 25, 2022, 3:02 a.m. UTC | #3
On Thu, Mar 24, 2022 at 10:27 PM Chris Down <chris@chrisdown.name> wrote:
>
> I'm confused by the aims of this patch. We already have proportional reclaim
> for memory.min and memory.low, and memory.high is already "proportional" by its
> nature to drive memory back down behind the configured threshold.
>
> Could you please be more clear about what you're trying to achieve and in what
> way the existing proportional reclaim mechanisms are insufficient for you?
What I am trying to solve is that, the memcg's protection judgment[1]
is based on a set of fixed value on current design, while the real
scan and reclaim number[2] is based on the proportional min/low on the
real memory usage which you mentioned above. Fixed value setting has
some constraints as
1. It is an experienced value based on observation, which could be inaccurate.
2. working load is various from scenarios.
3. fixed value from [1] could be against the dynamic cgroup_size in [2].

shrink_node_memcgs
     mem_cgroup_calculate_protection(target_memcg, memcg);          \
     if (mem_cgroup_below_min(memcg))
             \    ===> [1] check if the memcg is protected based on
fixed min/low value
     ...
                                        /
     else if (mem_cgroup_below_low(memcg))                                     /
     ...

     shrink_lruvec
            get_scan_count
                                              \
                   mem_cgroup_protection
                                         \ ===> [2] calculate the
number of scan size proportionally
                   scan = lruvec_size - lruvec_size * protection /
(cgroup_size + 1);        /
Zhaoyang Huang March 25, 2022, 3:08 a.m. UTC | #4
On Fri, Mar 25, 2022 at 11:02 AM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>
> On Thu, Mar 24, 2022 at 10:27 PM Chris Down <chris@chrisdown.name> wrote:
> >
> > I'm confused by the aims of this patch. We already have proportional reclaim
> > for memory.min and memory.low, and memory.high is already "proportional" by its
> > nature to drive memory back down behind the configured threshold.
> >
> > Could you please be more clear about what you're trying to achieve and in what
> > way the existing proportional reclaim mechanisms are insufficient for you?

sorry for the bad formatting of previous reply, resend it in new format

 What I am trying to solve is that, the memcg's protection judgment[1]
 is based on a set of fixed value on current design, while the real
 scan and reclaim number[2] is based on the proportional min/low on the
 real memory usage which you mentioned above. Fixed value setting has
 some constraints as
 1. It is an experienced value based on observation, which could be inaccurate.
 2. working load is various from scenarios.
 3. fixed value from [1] could be against the dynamic cgroup_size in [2].

 shrink_node_memcgs
[1] check if the memcg is protected based on fixed min/low value
     mem_cgroup_calculate_protection(target_memcg, memcg);
      if (mem_cgroup_below_min(memcg))
      ...
      else if (mem_cgroup_below_low(memcg))
      ...

[2] calculate the number of scan size proportionally
     shrink_lruvec
             get_scan_count
                    mem_cgroup_protection
                    scan = lruvec_size - lruvec_size * protection /
(cgroup_size + 1);
Zhaoyang Huang March 25, 2022, 3:10 a.m. UTC | #5
On Fri, Mar 25, 2022 at 12:23 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> It seems like what’s being proposed is an ability to express the protection in % of the current usage rather than an absolute number.
> It’s an equivalent for something like a memory (reclaim) priority: e.g. a cgroup with 80% protection is _always_ reclaimed less aggressively than one with a 20% protection.
>
> That said, I’m not a fan of this idea.
> It might make sense in some reasonable range of usages, but if your workload is simply leaking memory and growing indefinitely, protecting it seems like a bad idea. And the first part can be easily achieved using an userspace tool.
>
> Thanks!
>
> > On Mar 24, 2022, at 7:33 AM, Chris Down <chris@chrisdown.name> wrote:
> >
> > I'm confused by the aims of this patch. We already have proportional reclaim for memory.min and memory.low, and memory.high is already "proportional" by its nature to drive memory back down behind the configured threshold.
> >
> > Could you please be more clear about what you're trying to achieve and in what way the existing proportional reclaim mechanisms are insufficient for you?
ok, I think it could be fixable for memory leak issues. Please refer
to my reply on Chris's comment for more explanation.
Michal Hocko March 25, 2022, 12:49 p.m. UTC | #6
On Fri 25-03-22 11:08:00, Zhaoyang Huang wrote:
> On Fri, Mar 25, 2022 at 11:02 AM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
> >
> > On Thu, Mar 24, 2022 at 10:27 PM Chris Down <chris@chrisdown.name> wrote:
> > >
> > > I'm confused by the aims of this patch. We already have proportional reclaim
> > > for memory.min and memory.low, and memory.high is already "proportional" by its
> > > nature to drive memory back down behind the configured threshold.
> > >
> > > Could you please be more clear about what you're trying to achieve and in what
> > > way the existing proportional reclaim mechanisms are insufficient for you?
> 
> sorry for the bad formatting of previous reply, resend it in new format
> 
>  What I am trying to solve is that, the memcg's protection judgment[1]
>  is based on a set of fixed value on current design, while the real
>  scan and reclaim number[2] is based on the proportional min/low on the
>  real memory usage which you mentioned above. Fixed value setting has
>  some constraints as
>  1. It is an experienced value based on observation, which could be inaccurate.
>  2. working load is various from scenarios.
>  3. fixed value from [1] could be against the dynamic cgroup_size in [2].

Could you elaborate some more about those points. I guess providing an
example how you are using the new interface instead would be helpful.
diff mbox series

Patch

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 6795913..7762629 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -27,6 +27,9 @@  struct page_counter {
 	unsigned long watermark;
 	unsigned long failcnt;
 
+	/* proportional protection */
+	unsigned long min_prop;
+	unsigned long low_prop;
 	/*
 	 * 'parent' is placed here to be far from 'usage' to reduce
 	 * cache false sharing, as 'usage' is written mostly while
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 508bcea..937c6ce 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6616,6 +6616,7 @@  void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 {
 	unsigned long usage, parent_usage;
 	struct mem_cgroup *parent;
+	unsigned long memcg_emin, memcg_elow, parent_emin, parent_elow;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -6650,14 +6651,22 @@  void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 
 	parent_usage = page_counter_read(&parent->memory);
 
+	/* use proportional protect first and take 1024 as 100% */
+	memcg_emin = READ_ONCE(memcg->memory.min_prop) ?
+		READ_ONCE(memcg->memory.min_prop) * READ_ONCE(memcg->memory.watermark) / 1024 : READ_ONCE(memcg->memory.min);
+	memcg_elow = READ_ONCE(memcg->memory.low_prop) ?
+		READ_ONCE(memcg->memory.low_prop) * READ_ONCE(memcg->memory.watermark) / 1024 : READ_ONCE(memcg->memory.low);
+	parent_emin = READ_ONCE(parent->memory.min_prop) ?
+		READ_ONCE(parent->memory.min_prop) * READ_ONCE(parent->memory.watermark) / 1024 : READ_ONCE(parent->memory.emin);
+	parent_elow = READ_ONCE(parent->memory.low_prop) ?
+		READ_ONCE(parent->memory.low_prop) * READ_ONCE(parent->memory.watermark) / 1024 : READ_ONCE(parent->memory.elow);
+
 	WRITE_ONCE(memcg->memory.emin, effective_protection(usage, parent_usage,
-			READ_ONCE(memcg->memory.min),
-			READ_ONCE(parent->memory.emin),
+			memcg_emin, parent_emin,
 			atomic_long_read(&parent->memory.children_min_usage)));
 
 	WRITE_ONCE(memcg->memory.elow, effective_protection(usage, parent_usage,
-			READ_ONCE(memcg->memory.low),
-			READ_ONCE(parent->memory.elow),
+			memcg_elow, parent_elow,
 			atomic_long_read(&parent->memory.children_low_usage)));
 }