diff mbox series

[RFC] cgroup: introduce dynamic protection for memcg

Message ID 1648713656-24254-1-git-send-email-zhaoyang.huang@unisoc.com (mailing list archive)
State New
Headers show
Series [RFC] cgroup: introduce dynamic protection for memcg | expand

Commit Message

zhaoyang.huang March 31, 2022, 8 a.m. UTC
From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

For some kind of memcg, the usage is varies greatly from scenarios. Such as
multimedia app could have the usage range from 50MB to 500MB, which generated
by loading an special algorithm into its virtual address space and make it hard
to protect the expanded usage without userspace's interaction. Furthermore, fixed
memory.low is a little bit against its role of soft protection as it will response
any system's memory pressure in same way.

Taking all above into consideration, we introduce a kind of dynamic protection
based on group's watermark and system's memory pressure in this patch. Our aims are:
1. dynamic protection with no fixed setting
2. proper protection value on memory.current
3. time based decay protection
4. memory pressue related protection

The basic concept could be descripted as bellowing, where we take group->watermark
as a representative of usage
		group->memory.low = decayed_watermark * decay_factor
		decayed_watermark = group->watermark * func_wm_decay(time)
		decay_factor = psi_system[PSI_MEM][time]

func_wm_decay could be deemed as a linear decay funcion that will decay 1/2 in
68s(36bit).If we take 2048 as "1", it could be descripted as:
		decayed_watermark = time >> (group->wm_dec_factor - 10)
		decayed_watermark = new_usage(if new_usage > decayed_watermark)

decay_factor is as simple as a table lookingup and compose the final value by
weight of some and full as
		some = psi_system.avg[PSI_MEM * 2][time]
		full = psi_system.avg[PSI_MEM * 2 + 1][time]
		decay_factor = some * 70% + full *30%

We simply test above change on a v5.4 based system in bellowing topology and
observe some behavious as we expected:
      A
     / \
    B   C
1. With regard to the protection, elow is in a proper range as proportion of watermark.
2. Elapsed time has positive impact on elow via decayed_watermark.
3. Memory pressure has negitive impact on elow which could keep more usage when
   system is under less pressure.

PS: It should be configured as a sub-type of memcg and choosed by the user when
create the group.

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
 include/linux/memcontrol.h   | 50 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/page_counter.h |  4 ++++
 include/linux/psi.h          |  2 ++
 kernel/sched/psi.c           | 18 ++++++++++++++++
 mm/memcontrol.c              |  4 ++++
 mm/page_counter.c            |  4 ++++
 6 files changed, 82 insertions(+)

Comments

Michal Hocko March 31, 2022, 9:01 a.m. UTC | #1
On Thu 31-03-22 16:00:56, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> For some kind of memcg, the usage is varies greatly from scenarios. Such as
> multimedia app could have the usage range from 50MB to 500MB, which generated
> by loading an special algorithm into its virtual address space and make it hard
> to protect the expanded usage without userspace's interaction.

Do I get it correctly that the concern you have is that you do not know
how much memory your workload will need because that depends on some
parameters?

> Furthermore, fixed
> memory.low is a little bit against its role of soft protection as it will response
> any system's memory pressure in same way.

Could you be more specific about this as well?

> Taking all above into consideration, we introduce a kind of dynamic protection
> based on group's watermark and system's memory pressure in this patch. Our aims are:
> 1. dynamic protection with no fixed setting
> 2. proper protection value on memory.current

How do you define the "proper protection"

> 3. time based decay protection
> 4. memory pressue related protection
> 
> The basic concept could be descripted as bellowing, where we take group->watermark
> as a representative of usage
> 		group->memory.low = decayed_watermark * decay_factor
> 		decayed_watermark = group->watermark * func_wm_decay(time)
> 		decay_factor = psi_system[PSI_MEM][time]
> 
> func_wm_decay could be deemed as a linear decay funcion that will decay 1/2 in
> 68s(36bit).If we take 2048 as "1", it could be descripted as:
> 		decayed_watermark = time >> (group->wm_dec_factor - 10)
> 		decayed_watermark = new_usage(if new_usage > decayed_watermark)
> 
> decay_factor is as simple as a table lookingup and compose the final value by
> weight of some and full as
> 		some = psi_system.avg[PSI_MEM * 2][time]
> 		full = psi_system.avg[PSI_MEM * 2 + 1][time]
> 		decay_factor = some * 70% + full *30%

One very important thing that I am missing here is the overall objective of this
tuning. From the above it seems that you want to (ab)use memory->low to
protect some portion of the charged memory and that the protection
shrinks over time depending on the the global PSI metrict and time.
But why this is a good thing?

Also you seem to base your back off on the global PSI numbers. How does
that cope with a memcg memory pressure?

> We simply test above change on a v5.4 based system in bellowing topology and
> observe some behavious as we expected:
>       A
>      / \
>     B   C
> 1. With regard to the protection, elow is in a proper range as proportion of watermark.
> 2. Elapsed time has positive impact on elow via decayed_watermark.
> 3. Memory pressure has negitive impact on elow which could keep more usage when
>    system is under less pressure.

I am sorry but this doesn't really help much. As pointed out out above,
I completely fail to see what is the expected behavior and why it makes
sense.

> PS: It should be configured as a sub-type of memcg and choosed by the user when
> create the group.

I do not understand but I suspect that a non-RFC proposal would use a
dedicated user interface. Your current implementation is surely not
acceptable as it changes semantic of a limit without any way to opt-out
of that behavior. Memory low has a well established semantic so this
cannot really be changed without an explicit opt-in.

> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> ---
>  include/linux/memcontrol.h   | 50 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/page_counter.h |  4 ++++
>  include/linux/psi.h          |  2 ++
>  kernel/sched/psi.c           | 18 ++++++++++++++++
>  mm/memcontrol.c              |  4 ++++
>  mm/page_counter.c            |  4 ++++
>  6 files changed, 82 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0c5c403..a510057 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -21,6 +21,9 @@
>  #include <linux/vmstat.h>
>  #include <linux/writeback.h>
>  #include <linux/page-flags.h>
> +#include <linux/sched/loadavg.h>
> +#include <linux/sched/clock.h>
> +#include <linux/psi.h>
>  
>  struct mem_cgroup;
>  struct obj_cgroup;
> @@ -28,6 +31,8 @@
>  struct mm_struct;
>  struct kmem_cache;
>  
> +#define MEMCG_INTERVAL	(2*HZ+1)	/* 2 sec intervals */
> +
>  /* Cgroup-specific page state, on top of universal node page state */
>  enum memcg_stat_item {
>  	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
> @@ -340,6 +345,10 @@ struct mem_cgroup {
>  	struct deferred_split deferred_split_queue;
>  #endif
>  
> +	u64 wm_dec_fact;
> +	u64 avg_next_update;
> +	u64 avg_last_update;
> +
>  	struct mem_cgroup_per_node *nodeinfo[];
>  };
>  
> @@ -608,6 +617,47 @@ static inline bool mem_cgroup_disabled(void)
>  	return !cgroup_subsys_enabled(memory_cgrp_subsys);
>  }
>  
> +/*
> + * calculate memory.low based on the historic watermark and memory pressure
> + */
> +static inline void calc_protected_low(struct mem_cgroup *group)
> +{
> +	u64 now, decay_factor;
> +	u64 decayed_watermark;
> +	u64 delta_time;
> +
> +	now = sched_clock();
> +
> +	if (!group->avg_next_update) {
> +		group->avg_next_update = now + jiffies_to_nsecs(5*HZ);
> +		return;
> +	}
> +
> +	if (time_before((unsigned long)now, (unsigned long)group->avg_next_update))
> +		return;
> +
> +	delta_time = group->avg_last_update ? now - group->avg_last_update : 0;
> +	/*
> +	 * we take 2048 as "1" and 68s decay 1/2(36bit) by default
> +	 * decay_factor = 1024 * delta_time / 68s(0x1000000000)
> +	 * 0.5(1024)/68s = decay_factor/delta_time ==> decay_factor = delta_time >> 26
> +	 */
> +	decay_factor = (2048 - min(2048ULL, delta_time >> (group->wm_dec_fact - 10)));
> +	decayed_watermark = group->memory.decayed_watermark * decay_factor / 2048;
> +	/* decay_factor: based on average memory pressure over elapsed time */
> +	decay_factor = psi_mem_get(delta_time);
> +	group->memory.low = decayed_watermark * (100 - decay_factor) / 100;
> +
> +	/*
> +	 * avg_next_update: expected expire time according to current status
> +	 */
> +	group->memory.decayed_watermark = decayed_watermark;
> +	group->avg_last_update = now;
> +	group->avg_next_update = now + jiffies_to_nsecs(2*HZ);
> +
> +	return;
> +}
> +
>  static inline void mem_cgroup_protection(struct mem_cgroup *root,
>  					 struct mem_cgroup *memcg,
>  					 unsigned long *min,
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 6795913..2720eb9f 100644
> --- a/include/linux/page_counter.h



> +++ b/include/linux/page_counter.h
> @@ -25,8 +25,12 @@ struct page_counter {
>  
>  	/* legacy */
>  	unsigned long watermark;
> +	unsigned long decayed_watermark;
>  	unsigned long failcnt;
>  
> +	/* proportional protection */
> +	unsigned long min_prop;
> +	unsigned long low_prop;
>  	/*
>  	 * 'parent' is placed here to be far from 'usage' to reduce
>  	 * cache false sharing, as 'usage' is written mostly while
> diff --git a/include/linux/psi.h b/include/linux/psi.h
> index 65eb147..6c76993 100644
> --- a/include/linux/psi.h
> +++ b/include/linux/psi.h
> @@ -25,6 +25,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>  
>  int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
>  
> +unsigned long psi_mem_get(unsigned long time);
> +
>  #ifdef CONFIG_CGROUPS
>  int psi_cgroup_alloc(struct cgroup *cgrp);
>  void psi_cgroup_free(struct cgroup *cgrp);
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index dd80bd2..8d315e0 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -291,6 +291,24 @@ static void get_recent_times(struct psi_group *group, int cpu,
>  	}
>  }
>  
> +unsigned long psi_mem_get(unsigned long time_ns)
> +{
> +	unsigned long time_sec = time_ns / (1000 * 1000 * 1000);
> +	unsigned long some, full;
> +	if (time_sec < 10) {
> +		some = LOAD_INT(psi_system.avg[PSI_MEM * 2][0]);
> +		full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][0]);
> +	} else if (time_sec < 60) {
> +		some = LOAD_INT(psi_system.avg[PSI_MEM * 2][1]);
> +		full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][1]);
> +	} else {
> +		some = LOAD_INT(psi_system.avg[PSI_MEM * 2][2]);
> +		full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][2]);
> +	}
> +
> +	return (some * 768 + full * 256) / 1024;
> +}
> +
>  static void calc_avgs(unsigned long avg[3], int missed_periods,
>  		      u64 time, u64 period)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 508bcea..6b579a4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5188,6 +5188,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>  	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
>  	memcg->soft_limit = PAGE_COUNTER_MAX;
>  	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
> +	memcg->wm_dec_fact = 36;
>  	if (parent) {
>  		memcg->swappiness = mem_cgroup_swappiness(parent);
>  		memcg->oom_kill_disable = parent->oom_kill_disable;
> @@ -6616,6 +6617,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
>  {
>  	unsigned long usage, parent_usage;
>  	struct mem_cgroup *parent;
> +	unsigned long memcg_emin, memcg_elow, parent_emin, parent_elow;
> +	unsigned long watermark;
>  
>  	if (mem_cgroup_disabled())
>  		return;
> @@ -6642,6 +6645,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
>  	if (!parent)
>  		return;
>  
> +	calc_protected_low(memcg);
>  	if (parent == root) {
>  		memcg->memory.emin = READ_ONCE(memcg->memory.min);
>  		memcg->memory.elow = READ_ONCE(memcg->memory.low);
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index 7d83641..18abfdd 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -83,6 +83,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>  		 */
>  		if (new > READ_ONCE(c->watermark))
>  			WRITE_ONCE(c->watermark, new);
> +		if (new > READ_ONCE(c->decayed_watermark))
> +			WRITE_ONCE(c->decayed_watermark, new);
>  	}
>  }
>  
> @@ -137,6 +139,8 @@ bool page_counter_try_charge(struct page_counter *counter,
>  		 */
>  		if (new > READ_ONCE(c->watermark))
>  			WRITE_ONCE(c->watermark, new);
> +		if (new > READ_ONCE(c->decayed_watermark))
> +			WRITE_ONCE(c->decayed_watermark, new);
>  	}
>  	return true;
>  
> -- 
> 1.9.1
Zhaoyang Huang March 31, 2022, 11:18 a.m. UTC | #2
On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 31-03-22 16:00:56, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > For some kind of memcg, the usage is varies greatly from scenarios. Such as
> > multimedia app could have the usage range from 50MB to 500MB, which generated
> > by loading an special algorithm into its virtual address space and make it hard
> > to protect the expanded usage without userspace's interaction.
>
> Do I get it correctly that the concern you have is that you do not know
> how much memory your workload will need because that depends on some
> parameters?
right. such as a camera APP will expand the usage from 50MB to 500MB
because of launching a special function(face beauty etc need special
algorithm)
>
> > Furthermore, fixed
> > memory.low is a little bit against its role of soft protection as it will response
> > any system's memory pressure in same way.
>
> Could you be more specific about this as well?
As the camera case above, if we set memory.low as 200MB to keep the
APP run smoothly, the system will experience high memory pressure when
another high load APP launched simultaneously. I would like to have
camera be reclaimed under this scenario.
>
> > Taking all above into consideration, we introduce a kind of dynamic protection
> > based on group's watermark and system's memory pressure in this patch. Our aims are:
> > 1. dynamic protection with no fixed setting
> > 2. proper protection value on memory.current
>
> How do you define the "proper protection"
I would like to share more test result in next version
>
> > 3. time based decay protection
> > 4. memory pressue related protection
> >
> > The basic concept could be descripted as bellowing, where we take group->watermark
> > as a representative of usage
> >               group->memory.low = decayed_watermark * decay_factor
> >               decayed_watermark = group->watermark * func_wm_decay(time)
> >               decay_factor = psi_system[PSI_MEM][time]
> >
> > func_wm_decay could be deemed as a linear decay funcion that will decay 1/2 in
> > 68s(36bit).If we take 2048 as "1", it could be descripted as:
> >               decayed_watermark = time >> (group->wm_dec_factor - 10)
> >               decayed_watermark = new_usage(if new_usage > decayed_watermark)
> >
> > decay_factor is as simple as a table lookingup and compose the final value by
> > weight of some and full as
> >               some = psi_system.avg[PSI_MEM * 2][time]
> >               full = psi_system.avg[PSI_MEM * 2 + 1][time]
> >               decay_factor = some * 70% + full *30%
>
> One very important thing that I am missing here is the overall objective of this
> tuning. From the above it seems that you want to (ab)use memory->low to
> protect some portion of the charged memory and that the protection
> shrinks over time depending on the the global PSI metrict and time.
> But why this is a good thing?
'Good' means it meets my original goal of keeping the usage during a
period of time and responding to the system's memory pressure. For an
android like system, memory is almost forever being in a tight status
no matter how many RAM it has. What we need from memcg is more than
control and grouping, we need it to be more responsive to the system's
load and could  sacrifice its usage  under certain criteria.
>
> Also you seem to base your back off on the global PSI numbers. How does
> that cope with a memcg memory pressure?
For simplification, just only take global PSI into consideration now.
>
> > We simply test above change on a v5.4 based system in bellowing topology and
> > observe some behavious as we expected:
> >       A
> >      / \
> >     B   C
> > 1. With regard to the protection, elow is in a proper range as proportion of watermark.
> > 2. Elapsed time has positive impact on elow via decayed_watermark.
> > 3. Memory pressure has negitive impact on elow which could keep more usage when
> >    system is under less pressure.
>
> I am sorry but this doesn't really help much. As pointed out out above,
> I completely fail to see what is the expected behavior and why it makes
> sense.
I will test it under real scenarios and provide more intuitive result。
>
> > PS: It should be configured as a sub-type of memcg and choosed by the user when
> > create the group.
>
> I do not understand but I suspect that a non-RFC proposal would use a
> dedicated user interface. Your current implementation is surely not
> acceptable as it changes semantic of a limit without any way to opt-out
> of that behavior. Memory low has a well established semantic so this
> cannot really be changed without an explicit opt-in.
opt-in changes will be added in next version and would NOT affect
current design.
>
> > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > ---
> >  include/linux/memcontrol.h   | 50 ++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/page_counter.h |  4 ++++
> >  include/linux/psi.h          |  2 ++
> >  kernel/sched/psi.c           | 18 ++++++++++++++++
> >  mm/memcontrol.c              |  4 ++++
> >  mm/page_counter.c            |  4 ++++
> >  6 files changed, 82 insertions(+)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 0c5c403..a510057 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -21,6 +21,9 @@
> >  #include <linux/vmstat.h>
> >  #include <linux/writeback.h>
> >  #include <linux/page-flags.h>
> > +#include <linux/sched/loadavg.h>
> > +#include <linux/sched/clock.h>
> > +#include <linux/psi.h>
> >
> >  struct mem_cgroup;
> >  struct obj_cgroup;
> > @@ -28,6 +31,8 @@
> >  struct mm_struct;
> >  struct kmem_cache;
> >
> > +#define MEMCG_INTERVAL       (2*HZ+1)        /* 2 sec intervals */
> > +
> >  /* Cgroup-specific page state, on top of universal node page state */
> >  enum memcg_stat_item {
> >       MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
> > @@ -340,6 +345,10 @@ struct mem_cgroup {
> >       struct deferred_split deferred_split_queue;
> >  #endif
> >
> > +     u64 wm_dec_fact;
> > +     u64 avg_next_update;
> > +     u64 avg_last_update;
> > +
> >       struct mem_cgroup_per_node *nodeinfo[];
> >  };
> >
> > @@ -608,6 +617,47 @@ static inline bool mem_cgroup_disabled(void)
> >       return !cgroup_subsys_enabled(memory_cgrp_subsys);
> >  }
> >
> > +/*
> > + * calculate memory.low based on the historic watermark and memory pressure
> > + */
> > +static inline void calc_protected_low(struct mem_cgroup *group)
> > +{
> > +     u64 now, decay_factor;
> > +     u64 decayed_watermark;
> > +     u64 delta_time;
> > +
> > +     now = sched_clock();
> > +
> > +     if (!group->avg_next_update) {
> > +             group->avg_next_update = now + jiffies_to_nsecs(5*HZ);
> > +             return;
> > +     }
> > +
> > +     if (time_before((unsigned long)now, (unsigned long)group->avg_next_update))
> > +             return;
> > +
> > +     delta_time = group->avg_last_update ? now - group->avg_last_update : 0;
> > +     /*
> > +      * we take 2048 as "1" and 68s decay 1/2(36bit) by default
> > +      * decay_factor = 1024 * delta_time / 68s(0x1000000000)
> > +      * 0.5(1024)/68s = decay_factor/delta_time ==> decay_factor = delta_time >> 26
> > +      */
> > +     decay_factor = (2048 - min(2048ULL, delta_time >> (group->wm_dec_fact - 10)));
> > +     decayed_watermark = group->memory.decayed_watermark * decay_factor / 2048;
> > +     /* decay_factor: based on average memory pressure over elapsed time */
> > +     decay_factor = psi_mem_get(delta_time);
> > +     group->memory.low = decayed_watermark * (100 - decay_factor) / 100;
> > +
> > +     /*
> > +      * avg_next_update: expected expire time according to current status
> > +      */
> > +     group->memory.decayed_watermark = decayed_watermark;
> > +     group->avg_last_update = now;
> > +     group->avg_next_update = now + jiffies_to_nsecs(2*HZ);
> > +
> > +     return;
> > +}
> > +
> >  static inline void mem_cgroup_protection(struct mem_cgroup *root,
> >                                        struct mem_cgroup *memcg,
> >                                        unsigned long *min,
> > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > index 6795913..2720eb9f 100644
> > --- a/include/linux/page_counter.h
>
>
>
> > +++ b/include/linux/page_counter.h
> > @@ -25,8 +25,12 @@ struct page_counter {
> >
> >       /* legacy */
> >       unsigned long watermark;
> > +     unsigned long decayed_watermark;
> >       unsigned long failcnt;
> >
> > +     /* proportional protection */
> > +     unsigned long min_prop;
> > +     unsigned long low_prop;
> >       /*
> >        * 'parent' is placed here to be far from 'usage' to reduce
> >        * cache false sharing, as 'usage' is written mostly while
> > diff --git a/include/linux/psi.h b/include/linux/psi.h
> > index 65eb147..6c76993 100644
> > --- a/include/linux/psi.h
> > +++ b/include/linux/psi.h
> > @@ -25,6 +25,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> >
> >  int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
> >
> > +unsigned long psi_mem_get(unsigned long time);
> > +
> >  #ifdef CONFIG_CGROUPS
> >  int psi_cgroup_alloc(struct cgroup *cgrp);
> >  void psi_cgroup_free(struct cgroup *cgrp);
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index dd80bd2..8d315e0 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -291,6 +291,24 @@ static void get_recent_times(struct psi_group *group, int cpu,
> >       }
> >  }
> >
> > +unsigned long psi_mem_get(unsigned long time_ns)
> > +{
> > +     unsigned long time_sec = time_ns / (1000 * 1000 * 1000);
> > +     unsigned long some, full;
> > +     if (time_sec < 10) {
> > +             some = LOAD_INT(psi_system.avg[PSI_MEM * 2][0]);
> > +             full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][0]);
> > +     } else if (time_sec < 60) {
> > +             some = LOAD_INT(psi_system.avg[PSI_MEM * 2][1]);
> > +             full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][1]);
> > +     } else {
> > +             some = LOAD_INT(psi_system.avg[PSI_MEM * 2][2]);
> > +             full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][2]);
> > +     }
> > +
> > +     return (some * 768 + full * 256) / 1024;
> > +}
> > +
> >  static void calc_avgs(unsigned long avg[3], int missed_periods,
> >                     u64 time, u64 period)
> >  {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 508bcea..6b579a4 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5188,6 +5188,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> >       page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
> >       memcg->soft_limit = PAGE_COUNTER_MAX;
> >       page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
> > +     memcg->wm_dec_fact = 36;
> >       if (parent) {
> >               memcg->swappiness = mem_cgroup_swappiness(parent);
> >               memcg->oom_kill_disable = parent->oom_kill_disable;
> > @@ -6616,6 +6617,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> >  {
> >       unsigned long usage, parent_usage;
> >       struct mem_cgroup *parent;
> > +     unsigned long memcg_emin, memcg_elow, parent_emin, parent_elow;
> > +     unsigned long watermark;
> >
> >       if (mem_cgroup_disabled())
> >               return;
> > @@ -6642,6 +6645,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> >       if (!parent)
> >               return;
> >
> > +     calc_protected_low(memcg);
> >       if (parent == root) {
> >               memcg->memory.emin = READ_ONCE(memcg->memory.min);
> >               memcg->memory.elow = READ_ONCE(memcg->memory.low);
> > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > index 7d83641..18abfdd 100644
> > --- a/mm/page_counter.c
> > +++ b/mm/page_counter.c
> > @@ -83,6 +83,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
> >                */
> >               if (new > READ_ONCE(c->watermark))
> >                       WRITE_ONCE(c->watermark, new);
> > +             if (new > READ_ONCE(c->decayed_watermark))
> > +                     WRITE_ONCE(c->decayed_watermark, new);
> >       }
> >  }
> >
> > @@ -137,6 +139,8 @@ bool page_counter_try_charge(struct page_counter *counter,
> >                */
> >               if (new > READ_ONCE(c->watermark))
> >                       WRITE_ONCE(c->watermark, new);
> > +             if (new > READ_ONCE(c->decayed_watermark))
> > +                     WRITE_ONCE(c->decayed_watermark, new);
> >       }
> >       return true;
> >
> > --
> > 1.9.1
>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko March 31, 2022, 11:35 a.m. UTC | #3
On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote:
> On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote:
> > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > >
> > > For some kind of memcg, the usage is varies greatly from scenarios. Such as
> > > multimedia app could have the usage range from 50MB to 500MB, which generated
> > > by loading an special algorithm into its virtual address space and make it hard
> > > to protect the expanded usage without userspace's interaction.
> >
> > Do I get it correctly that the concern you have is that you do not know
> > how much memory your workload will need because that depends on some
> > parameters?
> right. such as a camera APP will expand the usage from 50MB to 500MB
> because of launching a special function(face beauty etc need special
> algorithm)
> >
> > > Furthermore, fixed
> > > memory.low is a little bit against its role of soft protection as it will response
> > > any system's memory pressure in same way.
> >
> > Could you be more specific about this as well?
> As the camera case above, if we set memory.low as 200MB to keep the
> APP run smoothly, the system will experience high memory pressure when
> another high load APP launched simultaneously. I would like to have
> camera be reclaimed under this scenario.

OK, so you effectivelly want to keep the memory protection when there is
a "normal" memory pressure but want to relax the protection on other
high memory utilization situations?

How do you exactly tell a difference between a steady memory pressure
(say stream IO on the page cache) from "high load APP launched"? Should
you reduce the protection on the stram IO situation as well?

[...]
> > One very important thing that I am missing here is the overall objective of this
> > tuning. From the above it seems that you want to (ab)use memory->low to
> > protect some portion of the charged memory and that the protection
> > shrinks over time depending on the the global PSI metrict and time.
> > But why this is a good thing?
> 'Good' means it meets my original goal of keeping the usage during a
> period of time and responding to the system's memory pressure. For an
> android like system, memory is almost forever being in a tight status
> no matter how many RAM it has. What we need from memcg is more than
> control and grouping, we need it to be more responsive to the system's
> load and could  sacrifice its usage  under certain criteria.

Why existing tools/APIs are insufficient for that? You can watch for
both global and memcg memory pressure including PSI metrics and update
limits dynamically. Why is it necessary to put such a logic into the
kernel?
Suren Baghdasaryan March 31, 2022, 7:26 p.m. UTC | #4
On Thu, Mar 31, 2022 at 4:35 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote:
> > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote:
> > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > >
> > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as
> > > > multimedia app could have the usage range from 50MB to 500MB, which generated
> > > > by loading an special algorithm into its virtual address space and make it hard
> > > > to protect the expanded usage without userspace's interaction.
> > >
> > > Do I get it correctly that the concern you have is that you do not know
> > > how much memory your workload will need because that depends on some
> > > parameters?
> > right. such as a camera APP will expand the usage from 50MB to 500MB
> > because of launching a special function(face beauty etc need special
> > algorithm)
> > >
> > > > Furthermore, fixed
> > > > memory.low is a little bit against its role of soft protection as it will response
> > > > any system's memory pressure in same way.
> > >
> > > Could you be more specific about this as well?
> > As the camera case above, if we set memory.low as 200MB to keep the
> > APP run smoothly, the system will experience high memory pressure when
> > another high load APP launched simultaneously. I would like to have
> > camera be reclaimed under this scenario.
>
> OK, so you effectivelly want to keep the memory protection when there is
> a "normal" memory pressure but want to relax the protection on other
> high memory utilization situations?
>
> How do you exactly tell a difference between a steady memory pressure
> (say stream IO on the page cache) from "high load APP launched"? Should
> you reduce the protection on the stram IO situation as well?

IIUC what you are implementing here is a "memory allowance boost"
feature and it seems you are implementing it entirely inside the
kernel, while only userspace knows when to apply this boost (say at
app launch time). This does not make sense to me.

>
> [...]
> > > One very important thing that I am missing here is the overall objective of this
> > > tuning. From the above it seems that you want to (ab)use memory->low to
> > > protect some portion of the charged memory and that the protection
> > > shrinks over time depending on the the global PSI metrict and time.
> > > But why this is a good thing?
> > 'Good' means it meets my original goal of keeping the usage during a
> > period of time and responding to the system's memory pressure. For an
> > android like system, memory is almost forever being in a tight status
> > no matter how many RAM it has. What we need from memcg is more than
> > control and grouping, we need it to be more responsive to the system's
> > load and could  sacrifice its usage  under certain criteria.
>
> Why existing tools/APIs are insufficient for that? You can watch for
> both global and memcg memory pressure including PSI metrics and update
> limits dynamically. Why is it necessary to put such a logic into the
> kernel?

I had exactly the same thought while reading through this.
In Android you would probably need to implement a userspace service
which would temporarily relax the memcg limits when required, monitor
PSI levels and adjust the limits accordingly.

>
> --
> Michal Hocko
> SUSE Labs
Zhaoyang Huang April 1, 2022, 1:34 a.m. UTC | #5
On Thu, Mar 31, 2022 at 7:35 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote:
> > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote:
> > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > >
> > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as
> > > > multimedia app could have the usage range from 50MB to 500MB, which generated
> > > > by loading an special algorithm into its virtual address space and make it hard
> > > > to protect the expanded usage without userspace's interaction.
> > >
> > > Do I get it correctly that the concern you have is that you do not know
> > > how much memory your workload will need because that depends on some
> > > parameters?
> > right. such as a camera APP will expand the usage from 50MB to 500MB
> > because of launching a special function(face beauty etc need special
> > algorithm)
> > >
> > > > Furthermore, fixed
> > > > memory.low is a little bit against its role of soft protection as it will response
> > > > any system's memory pressure in same way.
> > >
> > > Could you be more specific about this as well?
> > As the camera case above, if we set memory.low as 200MB to keep the
> > APP run smoothly, the system will experience high memory pressure when
> > another high load APP launched simultaneously. I would like to have
> > camera be reclaimed under this scenario.
>
> OK, so you effectivelly want to keep the memory protection when there is
> a "normal" memory pressure but want to relax the protection on other
> high memory utilization situations?
>
> How do you exactly tell a difference between a steady memory pressure
> (say stream IO on the page cache) from "high load APP launched"? Should
> you reduce the protection on the stram IO situation as well?
We can take either system's io_wait or PSI_IO into consideration for these.
>
> [...]
> > > One very important thing that I am missing here is the overall objective of this
> > > tuning. From the above it seems that you want to (ab)use memory->low to
> > > protect some portion of the charged memory and that the protection
> > > shrinks over time depending on the the global PSI metrict and time.
> > > But why this is a good thing?
> > 'Good' means it meets my original goal of keeping the usage during a
> > period of time and responding to the system's memory pressure. For an
> > android like system, memory is almost forever being in a tight status
> > no matter how many RAM it has. What we need from memcg is more than
> > control and grouping, we need it to be more responsive to the system's
> > load and could  sacrifice its usage  under certain criteria.
>
> Why existing tools/APIs are insufficient for that? You can watch for
> both global and memcg memory pressure including PSI metrics and update
> limits dynamically. Why is it necessary to put such a logic into the
> kernel?
Poll and then React method in userspace requires a polling interval
and response time. Take PSI as an example, it polls ten times during
POLLING_INTERVAL while just report once, which introduce latency in
some extend.
>
> --
> Michal Hocko
> SUSE Labs
Zhaoyang Huang April 1, 2022, 1:51 a.m. UTC | #6
On Fri, Apr 1, 2022 at 3:26 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Mar 31, 2022 at 4:35 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote:
> > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote:
> > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > >
> > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as
> > > > > multimedia app could have the usage range from 50MB to 500MB, which generated
> > > > > by loading an special algorithm into its virtual address space and make it hard
> > > > > to protect the expanded usage without userspace's interaction.
> > > >
> > > > Do I get it correctly that the concern you have is that you do not know
> > > > how much memory your workload will need because that depends on some
> > > > parameters?
> > > right. such as a camera APP will expand the usage from 50MB to 500MB
> > > because of launching a special function(face beauty etc need special
> > > algorithm)
> > > >
> > > > > Furthermore, fixed
> > > > > memory.low is a little bit against its role of soft protection as it will response
> > > > > any system's memory pressure in same way.
> > > >
> > > > Could you be more specific about this as well?
> > > As the camera case above, if we set memory.low as 200MB to keep the
> > > APP run smoothly, the system will experience high memory pressure when
> > > another high load APP launched simultaneously. I would like to have
> > > camera be reclaimed under this scenario.
> >
> > OK, so you effectivelly want to keep the memory protection when there is
> > a "normal" memory pressure but want to relax the protection on other
> > high memory utilization situations?
> >
> > How do you exactly tell a difference between a steady memory pressure
> > (say stream IO on the page cache) from "high load APP launched"? Should
> > you reduce the protection on the stram IO situation as well?
>
> IIUC what you are implementing here is a "memory allowance boost"
> feature and it seems you are implementing it entirely inside the
> kernel, while only userspace knows when to apply this boost (say at
> app launch time). This does not make sense to me.
I am wondering if it could be more helpful to apply this patch on the
background services(system_server etc) than APP, while the latter ones
are persistent to the system.
>
> >
> > [...]
> > > > One very important thing that I am missing here is the overall objective of this
> > > > tuning. From the above it seems that you want to (ab)use memory->low to
> > > > protect some portion of the charged memory and that the protection
> > > > shrinks over time depending on the the global PSI metrict and time.
> > > > But why this is a good thing?
> > > 'Good' means it meets my original goal of keeping the usage during a
> > > period of time and responding to the system's memory pressure. For an
> > > android like system, memory is almost forever being in a tight status
> > > no matter how many RAM it has. What we need from memcg is more than
> > > control and grouping, we need it to be more responsive to the system's
> > > load and could  sacrifice its usage  under certain criteria.
> >
> > Why existing tools/APIs are insufficient for that? You can watch for
> > both global and memcg memory pressure including PSI metrics and update
> > limits dynamically. Why is it necessary to put such a logic into the
> > kernel?
>
> I had exactly the same thought while reading through this.
> In Android you would probably need to implement a userspace service
> which would temporarily relax the memcg limits when required, monitor
> PSI levels and adjust the limits accordingly.
As my response to Michal's comment. Userspace monitors introduce
latency. Take LMKD as an example, it is actually driven by the
PSI_POLL_PERIOD_XXX_MS after first wakeup, which means
PSI_WINDOW_SIZE_MS could be too big to rely on. IMHO, with regards to
the responding time, LMKD is less efficient than lmk driver but more
strong in strategy things. I would like to test this patch in real
android's work load and feedback in next version.
>
> >
> > --
> > Michal Hocko
> > SUSE Labs
Suren Baghdasaryan April 1, 2022, 4:46 a.m. UTC | #7
On Thu, Mar 31, 2022 at 6:51 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>
> On Fri, Apr 1, 2022 at 3:26 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Mar 31, 2022 at 4:35 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote:
> > > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote:
> > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > >
> > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as
> > > > > > multimedia app could have the usage range from 50MB to 500MB, which generated
> > > > > > by loading an special algorithm into its virtual address space and make it hard
> > > > > > to protect the expanded usage without userspace's interaction.
> > > > >
> > > > > Do I get it correctly that the concern you have is that you do not know
> > > > > how much memory your workload will need because that depends on some
> > > > > parameters?
> > > > right. such as a camera APP will expand the usage from 50MB to 500MB
> > > > because of launching a special function(face beauty etc need special
> > > > algorithm)
> > > > >
> > > > > > Furthermore, fixed
> > > > > > memory.low is a little bit against its role of soft protection as it will response
> > > > > > any system's memory pressure in same way.
> > > > >
> > > > > Could you be more specific about this as well?
> > > > As the camera case above, if we set memory.low as 200MB to keep the
> > > > APP run smoothly, the system will experience high memory pressure when
> > > > another high load APP launched simultaneously. I would like to have
> > > > camera be reclaimed under this scenario.
> > >
> > > OK, so you effectivelly want to keep the memory protection when there is
> > > a "normal" memory pressure but want to relax the protection on other
> > > high memory utilization situations?
> > >
> > > How do you exactly tell a difference between a steady memory pressure
> > > (say stream IO on the page cache) from "high load APP launched"? Should
> > > you reduce the protection on the stram IO situation as well?
> >
> > IIUC what you are implementing here is a "memory allowance boost"
> > feature and it seems you are implementing it entirely inside the
> > kernel, while only userspace knows when to apply this boost (say at
> > app launch time). This does not make sense to me.
> I am wondering if it could be more helpful to apply this patch on the
> background services(system_server etc) than APP, while the latter ones
> are persistent to the system.
> >
> > >
> > > [...]
> > > > > One very important thing that I am missing here is the overall objective of this
> > > > > tuning. From the above it seems that you want to (ab)use memory->low to
> > > > > protect some portion of the charged memory and that the protection
> > > > > shrinks over time depending on the the global PSI metrict and time.
> > > > > But why this is a good thing?
> > > > 'Good' means it meets my original goal of keeping the usage during a
> > > > period of time and responding to the system's memory pressure. For an
> > > > android like system, memory is almost forever being in a tight status
> > > > no matter how many RAM it has. What we need from memcg is more than
> > > > control and grouping, we need it to be more responsive to the system's
> > > > load and could  sacrifice its usage  under certain criteria.
> > >
> > > Why existing tools/APIs are insufficient for that? You can watch for
> > > both global and memcg memory pressure including PSI metrics and update
> > > limits dynamically. Why is it necessary to put such a logic into the
> > > kernel?
> >
> > I had exactly the same thought while reading through this.
> > In Android you would probably need to implement a userspace service
> > which would temporarily relax the memcg limits when required, monitor
> > PSI levels and adjust the limits accordingly.
> As my response to Michal's comment. Userspace monitors introduce
> latency. Take LMKD as an example, it is actually driven by the
> PSI_POLL_PERIOD_XXX_MS after first wakeup, which means
> PSI_WINDOW_SIZE_MS could be too big to rely on. IMHO, with regards to
> the responding time, LMKD is less efficient than lmk driver but more
> strong in strategy things. I would like to test this patch in real
> android's work load and feedback in next version.

LMKD is a reactive mechanism which does not know when memory pressure
might rise, therefore its response latency matters.
The usecases you mentioned seemed to imply that userspace was aware of
increased memory demands of the process (app startup time, maybe the
moment the app becomes foreground, etc.). Therefore the userspace
could relax memory allowance before that memory is requested. Was my
understanding incorrect?

> >
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs
Michal Hocko April 1, 2022, 11:34 a.m. UTC | #8
On Fri 01-04-22 09:34:02, Zhaoyang Huang wrote:
> On Thu, Mar 31, 2022 at 7:35 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote:
> > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote:
> > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > >
> > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as
> > > > > multimedia app could have the usage range from 50MB to 500MB, which generated
> > > > > by loading an special algorithm into its virtual address space and make it hard
> > > > > to protect the expanded usage without userspace's interaction.
> > > >
> > > > Do I get it correctly that the concern you have is that you do not know
> > > > how much memory your workload will need because that depends on some
> > > > parameters?
> > > right. such as a camera APP will expand the usage from 50MB to 500MB
> > > because of launching a special function(face beauty etc need special
> > > algorithm)
> > > >
> > > > > Furthermore, fixed
> > > > > memory.low is a little bit against its role of soft protection as it will response
> > > > > any system's memory pressure in same way.
> > > >
> > > > Could you be more specific about this as well?
> > > As the camera case above, if we set memory.low as 200MB to keep the
> > > APP run smoothly, the system will experience high memory pressure when
> > > another high load APP launched simultaneously. I would like to have
> > > camera be reclaimed under this scenario.
> >
> > OK, so you effectivelly want to keep the memory protection when there is
> > a "normal" memory pressure but want to relax the protection on other
> > high memory utilization situations?
> >
> > How do you exactly tell a difference between a steady memory pressure
> > (say stream IO on the page cache) from "high load APP launched"? Should
> > you reduce the protection on the stram IO situation as well?
> We can take either system's io_wait or PSI_IO into consideration for these.

I do not follow. Let's say you have a stream IO workload which is mostly
RO. Reclaiming those pages means effectivelly to drop them from the
cache so there is no IO involved during the reclaim. This will generate
a constant flow of reclaim that shouldn't normally affect other
workloads (as long as kswapd keeps up with the IO pace). How does your
scheme cope with this scenario? My understanding is that it will simply
relax the protection.

> > [...]
> > > > One very important thing that I am missing here is the overall objective of this
> > > > tuning. From the above it seems that you want to (ab)use memory->low to
> > > > protect some portion of the charged memory and that the protection
> > > > shrinks over time depending on the the global PSI metrict and time.
> > > > But why this is a good thing?
> > > 'Good' means it meets my original goal of keeping the usage during a
> > > period of time and responding to the system's memory pressure. For an
> > > android like system, memory is almost forever being in a tight status
> > > no matter how many RAM it has. What we need from memcg is more than
> > > control and grouping, we need it to be more responsive to the system's
> > > load and could  sacrifice its usage  under certain criteria.
> >
> > Why existing tools/APIs are insufficient for that? You can watch for
> > both global and memcg memory pressure including PSI metrics and update
> > limits dynamically. Why is it necessary to put such a logic into the
> > kernel?
> Poll and then React method in userspace requires a polling interval
> and response time. Take PSI as an example, it polls ten times during
> POLLING_INTERVAL while just report once, which introduce latency in
> some extend.

Do workload transitions happen so often in your situation that the
interval really matters? As Suren already pointed out starting a new
application is usually an explicit event which can pro-activelly update
limits.
Zhaoyang Huang April 2, 2022, 3:21 a.m. UTC | #9
On Fri, Apr 1, 2022 at 12:46 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Mar 31, 2022 at 6:51 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
> >
> > On Fri, Apr 1, 2022 at 3:26 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Thu, Mar 31, 2022 at 4:35 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote:
> > > > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote:
> > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > > >
> > > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as
> > > > > > > multimedia app could have the usage range from 50MB to 500MB, which generated
> > > > > > > by loading an special algorithm into its virtual address space and make it hard
> > > > > > > to protect the expanded usage without userspace's interaction.
> > > > > >
> > > > > > Do I get it correctly that the concern you have is that you do not know
> > > > > > how much memory your workload will need because that depends on some
> > > > > > parameters?
> > > > > right. such as a camera APP will expand the usage from 50MB to 500MB
> > > > > because of launching a special function(face beauty etc need special
> > > > > algorithm)
> > > > > >
> > > > > > > Furthermore, fixed
> > > > > > > memory.low is a little bit against its role of soft protection as it will response
> > > > > > > any system's memory pressure in same way.
> > > > > >
> > > > > > Could you be more specific about this as well?
> > > > > As the camera case above, if we set memory.low as 200MB to keep the
> > > > > APP run smoothly, the system will experience high memory pressure when
> > > > > another high load APP launched simultaneously. I would like to have
> > > > > camera be reclaimed under this scenario.
> > > >
> > > > OK, so you effectivelly want to keep the memory protection when there is
> > > > a "normal" memory pressure but want to relax the protection on other
> > > > high memory utilization situations?
> > > >
> > > > How do you exactly tell a difference between a steady memory pressure
> > > > (say stream IO on the page cache) from "high load APP launched"? Should
> > > > you reduce the protection on the stram IO situation as well?
> > >
> > > IIUC what you are implementing here is a "memory allowance boost"
> > > feature and it seems you are implementing it entirely inside the
> > > kernel, while only userspace knows when to apply this boost (say at
> > > app launch time). This does not make sense to me.
> > I am wondering if it could be more helpful to apply this patch on the
> > background services(system_server etc) than APP, while the latter ones
> > are persistent to the system.
> > >
> > > >
> > > > [...]
> > > > > > One very important thing that I am missing here is the overall objective of this
> > > > > > tuning. From the above it seems that you want to (ab)use memory->low to
> > > > > > protect some portion of the charged memory and that the protection
> > > > > > shrinks over time depending on the the global PSI metrict and time.
> > > > > > But why this is a good thing?
> > > > > 'Good' means it meets my original goal of keeping the usage during a
> > > > > period of time and responding to the system's memory pressure. For an
> > > > > android like system, memory is almost forever being in a tight status
> > > > > no matter how many RAM it has. What we need from memcg is more than
> > > > > control and grouping, we need it to be more responsive to the system's
> > > > > load and could  sacrifice its usage  under certain criteria.
> > > >
> > > > Why existing tools/APIs are insufficient for that? You can watch for
> > > > both global and memcg memory pressure including PSI metrics and update
> > > > limits dynamically. Why is it necessary to put such a logic into the
> > > > kernel?
> > >
> > > I had exactly the same thought while reading through this.
> > > In Android you would probably need to implement a userspace service
> > > which would temporarily relax the memcg limits when required, monitor
> > > PSI levels and adjust the limits accordingly.
> > As my response to Michal's comment. Userspace monitors introduce
> > latency. Take LMKD as an example, it is actually driven by the
> > PSI_POLL_PERIOD_XXX_MS after first wakeup, which means
> > PSI_WINDOW_SIZE_MS could be too big to rely on. IMHO, with regards to
> > the responding time, LMKD is less efficient than lmk driver but more
> > strong in strategy things. I would like to test this patch in real
> > android's work load and feedback in next version.
>
> LMKD is a reactive mechanism which does not know when memory pressure
> might rise, therefore its response latency matters.
> The usecases you mentioned seemed to imply that userspace was aware of
> increased memory demands of the process (app startup time, maybe the
> moment the app becomes foreground, etc.). Therefore the userspace
> could relax memory allowance before that memory is requested. Was my
> understanding incorrect?
In terms of Android specific scenarios, I think it is not the best
choice to have AMS relax memory allowance directly when activity
arised, as:
1. AMS also has to face latency issues like LMKD does when polling
systems PSI value
2. AMS needs to launch syscalls to act.  A read for either
memcg.usage/watermark and a write on memcg.min/low.
3. Processes in CACHED adj are not capable of shrinking themselves
except waiting to be killed by LMKD.

IMO, this patch is alike an Pressure Aware Per Process Reclaim which
does NOT launch real reclaiming.
>
> > >
> > > >
> > > > --
> > > > Michal Hocko
> > > > SUSE Labs
Zhaoyang Huang April 2, 2022, 5:18 a.m. UTC | #10
On Fri, Apr 1, 2022 at 7:34 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 01-04-22 09:34:02, Zhaoyang Huang wrote:
> > On Thu, Mar 31, 2022 at 7:35 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote:
> > > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote:
> > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > >
> > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as
> > > > > > multimedia app could have the usage range from 50MB to 500MB, which generated
> > > > > > by loading an special algorithm into its virtual address space and make it hard
> > > > > > to protect the expanded usage without userspace's interaction.
> > > > >
> > > > > Do I get it correctly that the concern you have is that you do not know
> > > > > how much memory your workload will need because that depends on some
> > > > > parameters?
> > > > right. such as a camera APP will expand the usage from 50MB to 500MB
> > > > because of launching a special function(face beauty etc need special
> > > > algorithm)
> > > > >
> > > > > > Furthermore, fixed
> > > > > > memory.low is a little bit against its role of soft protection as it will response
> > > > > > any system's memory pressure in same way.
> > > > >
> > > > > Could you be more specific about this as well?
> > > > As the camera case above, if we set memory.low as 200MB to keep the
> > > > APP run smoothly, the system will experience high memory pressure when
> > > > another high load APP launched simultaneously. I would like to have
> > > > camera be reclaimed under this scenario.
> > >
> > > OK, so you effectivelly want to keep the memory protection when there is
> > > a "normal" memory pressure but want to relax the protection on other
> > > high memory utilization situations?
> > >
> > > How do you exactly tell a difference between a steady memory pressure
> > > (say stream IO on the page cache) from "high load APP launched"? Should
> > > you reduce the protection on the stram IO situation as well?
> > We can take either system's io_wait or PSI_IO into consideration for these.
>
> I do not follow. Let's say you have a stream IO workload which is mostly
> RO. Reclaiming those pages means effectivelly to drop them from the
> cache so there is no IO involved during the reclaim. This will generate
> a constant flow of reclaim that shouldn't normally affect other
> workloads (as long as kswapd keeps up with the IO pace). How does your
> scheme cope with this scenario? My understanding is that it will simply
> relax the protection.
You are right. This scheme treats the system's memory pressure
equally, no matter if it comes from in-kernel page allocation with
high order or cache drop by IO like things. The decay_factor composed
of PSI_SOME and PSI_FULL which represent the system is tight on
memory, every entity has the obligation to donate to solve this issue.
>
> > > [...]
> > > > > One very important thing that I am missing here is the overall objective of this
> > > > > tuning. From the above it seems that you want to (ab)use memory->low to
> > > > > protect some portion of the charged memory and that the protection
> > > > > shrinks over time depending on the the global PSI metrict and time.
> > > > > But why this is a good thing?
> > > > 'Good' means it meets my original goal of keeping the usage during a
> > > > period of time and responding to the system's memory pressure. For an
> > > > android like system, memory is almost forever being in a tight status
> > > > no matter how many RAM it has. What we need from memcg is more than
> > > > control and grouping, we need it to be more responsive to the system's
> > > > load and could  sacrifice its usage  under certain criteria.
> > >
> > > Why existing tools/APIs are insufficient for that? You can watch for
> > > both global and memcg memory pressure including PSI metrics and update
> > > limits dynamically. Why is it necessary to put such a logic into the
> > > kernel?
> > Poll and then React method in userspace requires a polling interval
> > and response time. Take PSI as an example, it polls ten times during
> > POLLING_INTERVAL while just report once, which introduce latency in
> > some extend.
>
> Do workload transitions happen so often in your situation that the
> interval really matters? As Suren already pointed out starting a new
> application is usually an explicit event which can pro-activelly update
> limits.
Yes. As my reply to Suren's comment, even a positive monitor service
which could be aware of the activity starting(APP launching etc) at
the very first time, has to 1. read PSI and memcg->watermark/usage 2.
make a decision. 3. write memcg->memory.low to adjust memory
allowance. Furthermore, monitors could not supervise the APP for whole
life time, while the reclaiming could arise at any time.

> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan April 3, 2022, 3:04 p.m. UTC | #11
On Fri, Apr 1, 2022 at 10:18 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>
> On Fri, Apr 1, 2022 at 7:34 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 01-04-22 09:34:02, Zhaoyang Huang wrote:
> > > On Thu, Mar 31, 2022 at 7:35 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote:
> > > > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote:
> > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > > >
> > > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as
> > > > > > > multimedia app could have the usage range from 50MB to 500MB, which generated
> > > > > > > by loading an special algorithm into its virtual address space and make it hard
> > > > > > > to protect the expanded usage without userspace's interaction.
> > > > > >
> > > > > > Do I get it correctly that the concern you have is that you do not know
> > > > > > how much memory your workload will need because that depends on some
> > > > > > parameters?
> > > > > right. such as a camera APP will expand the usage from 50MB to 500MB
> > > > > because of launching a special function(face beauty etc need special
> > > > > algorithm)
> > > > > >
> > > > > > > Furthermore, fixed
> > > > > > > memory.low is a little bit against its role of soft protection as it will response
> > > > > > > any system's memory pressure in same way.
> > > > > >
> > > > > > Could you be more specific about this as well?
> > > > > As the camera case above, if we set memory.low as 200MB to keep the
> > > > > APP run smoothly, the system will experience high memory pressure when
> > > > > another high load APP launched simultaneously. I would like to have
> > > > > camera be reclaimed under this scenario.
> > > >
> > > > OK, so you effectivelly want to keep the memory protection when there is
> > > > a "normal" memory pressure but want to relax the protection on other
> > > > high memory utilization situations?
> > > >
> > > > How do you exactly tell a difference between a steady memory pressure
> > > > (say stream IO on the page cache) from "high load APP launched"? Should
> > > > you reduce the protection on the stram IO situation as well?
> > > We can take either system's io_wait or PSI_IO into consideration for these.
> >
> > I do not follow. Let's say you have a stream IO workload which is mostly
> > RO. Reclaiming those pages means effectivelly to drop them from the
> > cache so there is no IO involved during the reclaim. This will generate
> > a constant flow of reclaim that shouldn't normally affect other
> > workloads (as long as kswapd keeps up with the IO pace). How does your
> > scheme cope with this scenario? My understanding is that it will simply
> > relax the protection.
> You are right. This scheme treats the system's memory pressure
> equally, no matter if it comes from in-kernel page allocation with
> high order or cache drop by IO like things. The decay_factor composed
> of PSI_SOME and PSI_FULL which represent the system is tight on
> memory, every entity has the obligation to donate to solve this issue.
> >
> > > > [...]
> > > > > > One very important thing that I am missing here is the overall objective of this
> > > > > > tuning. From the above it seems that you want to (ab)use memory->low to
> > > > > > protect some portion of the charged memory and that the protection
> > > > > > shrinks over time depending on the the global PSI metrict and time.
> > > > > > But why this is a good thing?
> > > > > 'Good' means it meets my original goal of keeping the usage during a
> > > > > period of time and responding to the system's memory pressure. For an
> > > > > android like system, memory is almost forever being in a tight status
> > > > > no matter how many RAM it has. What we need from memcg is more than
> > > > > control and grouping, we need it to be more responsive to the system's
> > > > > load and could  sacrifice its usage  under certain criteria.
> > > >
> > > > Why existing tools/APIs are insufficient for that? You can watch for
> > > > both global and memcg memory pressure including PSI metrics and update
> > > > limits dynamically. Why is it necessary to put such a logic into the
> > > > kernel?
> > > Poll and then React method in userspace requires a polling interval
> > > and response time. Take PSI as an example, it polls ten times during
> > > POLLING_INTERVAL while just report once, which introduce latency in
> > > some extend.
> >
> > Do workload transitions happen so often in your situation that the
> > interval really matters? As Suren already pointed out starting a new
> > application is usually an explicit event which can pro-activelly update
> > limits.
> Yes. As my reply to Suren's comment, even a positive monitor service
> which could be aware of the activity starting(APP launching etc) at
> the very first time, has to 1. read PSI and memcg->watermark/usage 2.
> make a decision. 3. write memcg->memory.low to adjust memory
> allowance. Furthermore, monitors could not supervise the APP for whole
> life time, while the reclaiming could arise at any time.

Ok, sounds like you want this dynamic limit to be active all the time,
not only at specific points in the process's life cycle.
One thing that I don't understand in this approach is: why memory.low
should depend on the system's memory pressure. It seems you want to
allow a process to allocate more when memory pressure is high. That is
very counter-intuitive to me. Could you please explain the underlying
logic of why this is the right thing to do, without going into
technical details?

>
> > --
> > Michal Hocko
> > SUSE Labs
Zhaoyang Huang April 4, 2022, 2:33 a.m. UTC | #12
On Sun, Apr 3, 2022 at 11:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Apr 1, 2022 at 10:18 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
> >
> > On Fri, Apr 1, 2022 at 7:34 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 01-04-22 09:34:02, Zhaoyang Huang wrote:
> > > > On Thu, Mar 31, 2022 at 7:35 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote:
> > > > > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote:
> > > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > > > >
> > > > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as
> > > > > > > > multimedia app could have the usage range from 50MB to 500MB, which generated
> > > > > > > > by loading an special algorithm into its virtual address space and make it hard
> > > > > > > > to protect the expanded usage without userspace's interaction.
> > > > > > >
> > > > > > > Do I get it correctly that the concern you have is that you do not know
> > > > > > > how much memory your workload will need because that depends on some
> > > > > > > parameters?
> > > > > > right. such as a camera APP will expand the usage from 50MB to 500MB
> > > > > > because of launching a special function(face beauty etc need special
> > > > > > algorithm)
> > > > > > >
> > > > > > > > Furthermore, fixed
> > > > > > > > memory.low is a little bit against its role of soft protection as it will response
> > > > > > > > any system's memory pressure in same way.
> > > > > > >
> > > > > > > Could you be more specific about this as well?
> > > > > > As the camera case above, if we set memory.low as 200MB to keep the
> > > > > > APP run smoothly, the system will experience high memory pressure when
> > > > > > another high load APP launched simultaneously. I would like to have
> > > > > > camera be reclaimed under this scenario.
> > > > >
> > > > > OK, so you effectivelly want to keep the memory protection when there is
> > > > > a "normal" memory pressure but want to relax the protection on other
> > > > > high memory utilization situations?
> > > > >
> > > > > How do you exactly tell a difference between a steady memory pressure
> > > > > (say stream IO on the page cache) from "high load APP launched"? Should
> > > > > you reduce the protection on the stram IO situation as well?
> > > > We can take either system's io_wait or PSI_IO into consideration for these.
> > >
> > > I do not follow. Let's say you have a stream IO workload which is mostly
> > > RO. Reclaiming those pages means effectivelly to drop them from the
> > > cache so there is no IO involved during the reclaim. This will generate
> > > a constant flow of reclaim that shouldn't normally affect other
> > > workloads (as long as kswapd keeps up with the IO pace). How does your
> > > scheme cope with this scenario? My understanding is that it will simply
> > > relax the protection.
> > You are right. This scheme treats the system's memory pressure
> > equally, no matter if it comes from in-kernel page allocation with
> > high order or cache drop by IO like things. The decay_factor composed
> > of PSI_SOME and PSI_FULL which represent the system is tight on
> > memory, every entity has the obligation to donate to solve this issue.
> > >
> > > > > [...]
> > > > > > > One very important thing that I am missing here is the overall objective of this
> > > > > > > tuning. From the above it seems that you want to (ab)use memory->low to
> > > > > > > protect some portion of the charged memory and that the protection
> > > > > > > shrinks over time depending on the the global PSI metrict and time.
> > > > > > > But why this is a good thing?
> > > > > > 'Good' means it meets my original goal of keeping the usage during a
> > > > > > period of time and responding to the system's memory pressure. For an
> > > > > > android like system, memory is almost forever being in a tight status
> > > > > > no matter how many RAM it has. What we need from memcg is more than
> > > > > > control and grouping, we need it to be more responsive to the system's
> > > > > > load and could  sacrifice its usage  under certain criteria.
> > > > >
> > > > > Why existing tools/APIs are insufficient for that? You can watch for
> > > > > both global and memcg memory pressure including PSI metrics and update
> > > > > limits dynamically. Why is it necessary to put such a logic into the
> > > > > kernel?
> > > > Poll and then React method in userspace requires a polling interval
> > > > and response time. Take PSI as an example, it polls ten times during
> > > > POLLING_INTERVAL while just report once, which introduce latency in
> > > > some extend.
> > >
> > > Do workload transitions happen so often in your situation that the
> > > interval really matters? As Suren already pointed out starting a new
> > > application is usually an explicit event which can pro-activelly update
> > > limits.
> > Yes. As my reply to Suren's comment, even a positive monitor service
> > which could be aware of the activity starting(APP launching etc) at
> > the very first time, has to 1. read PSI and memcg->watermark/usage 2.
> > make a decision. 3. write memcg->memory.low to adjust memory
> > allowance. Furthermore, monitors could not supervise the APP for whole
> > life time, while the reclaiming could arise at any time.
>
> Ok, sounds like you want this dynamic limit to be active all the time,
> not only at specific points in the process's life cycle.
Not sure yet. I think it would be better to cooperate with AMS like
things which could distinguish scenarios. Such as place
foreground/background processes into different memcgs with different
decay configurations OR dynamic set up the parameters to a specific
memcg.
> One thing that I don't understand in this approach is: why memory.low
> should depend on the system's memory pressure. It seems you want to
> allow a process to allocate more when memory pressure is high. That is
> very counter-intuitive to me. Could you please explain the underlying
> logic of why this is the right thing to do, without going into
> technical details?
What I want to achieve is make memory.low be positive correlation with
timing and negative to memory pressure, which means the protected
memcg should lower its protection(via lower memcg.low) for helping
system's memory pressure when it's high. The concept behind is memcg's
fault back of dropped memory is less important than system's latency
on high memory pressure. Please refer to my new version's test data
for more detail.
>
> >
> > > --
> > > Michal Hocko
> > > SUSE Labs
Michal Hocko April 4, 2022, 8:51 a.m. UTC | #13
On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote:
[...]
> > One thing that I don't understand in this approach is: why memory.low
> > should depend on the system's memory pressure. It seems you want to
> > allow a process to allocate more when memory pressure is high. That is
> > very counter-intuitive to me. Could you please explain the underlying
> > logic of why this is the right thing to do, without going into
> > technical details?
> What I want to achieve is make memory.low be positive correlation with
> timing and negative to memory pressure, which means the protected
> memcg should lower its protection(via lower memcg.low) for helping
> system's memory pressure when it's high.

I have to say this is still very confusing to me. The low limit is a
protection against external (e.g. global) memory pressure. Decreasing
the protection based on the external pressure sounds like it goes right
against the purpose of the knob. I can see reasons to update protection
based on refaults or other metrics from the userspace but I still do not
see how this is a good auto-magic tuning done by the kernel.

> The concept behind is memcg's
> fault back of dropped memory is less important than system's latency
> on high memory pressure.

Can you give some specific examples?

> Please refer to my new version's test data
> for more detail.

Please note that sending new RFCs will just make the discussion spread
over several email threads which will get increasingly hard to follow.
So do not post another version until it is really clear what is the
actual semantic you are proposing.
Zhaoyang Huang April 4, 2022, 9:07 a.m. UTC | #14
On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote:
> [...]
> > > One thing that I don't understand in this approach is: why memory.low
> > > should depend on the system's memory pressure. It seems you want to
> > > allow a process to allocate more when memory pressure is high. That is
> > > very counter-intuitive to me. Could you please explain the underlying
> > > logic of why this is the right thing to do, without going into
> > > technical details?
> > What I want to achieve is make memory.low be positive correlation with
> > timing and negative to memory pressure, which means the protected
> > memcg should lower its protection(via lower memcg.low) for helping
> > system's memory pressure when it's high.
>
> I have to say this is still very confusing to me. The low limit is a
> protection against external (e.g. global) memory pressure. Decreasing
> the protection based on the external pressure sounds like it goes right
> against the purpose of the knob. I can see reasons to update protection
> based on refaults or other metrics from the userspace but I still do not
> see how this is a good auto-magic tuning done by the kernel.
>
> > The concept behind is memcg's
> > fault back of dropped memory is less important than system's latency
> > on high memory pressure.
>
> Can you give some specific examples?
For both of the above two comments, please refer to the latest test
result in Patchv2 I have sent. I prefer to name my change as focus
transfer under pressure as protected memcg is the focus when system's
memory pressure is low which will reclaim from root, this is not
against current design. However, when global memory pressure is high,
then the focus has to be changed to the whole system, because it
doesn't make sense to let the protected memcg out of everybody, it
can't
do anything when the system is trapped in the kernel with reclaiming work.
>
> > Please refer to my new version's test data
> > for more detail.
>
> Please note that sending new RFCs will just make the discussion spread
> over several email threads which will get increasingly hard to follow.
> So do not post another version until it is really clear what is the
> actual semantic you are proposing.
ok, I will hold until all question done.
>
> --
> Michal Hocko
> SUSE Labs
Zhaoyang Huang April 4, 2022, 9:23 a.m. UTC | #15
On Mon, Apr 4, 2022 at 5:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>
> On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote:
> > [...]
> > > > One thing that I don't understand in this approach is: why memory.low
> > > > should depend on the system's memory pressure. It seems you want to
> > > > allow a process to allocate more when memory pressure is high. That is
> > > > very counter-intuitive to me. Could you please explain the underlying
> > > > logic of why this is the right thing to do, without going into
> > > > technical details?
> > > What I want to achieve is make memory.low be positive correlation with
> > > timing and negative to memory pressure, which means the protected
> > > memcg should lower its protection(via lower memcg.low) for helping
> > > system's memory pressure when it's high.
> >
> > I have to say this is still very confusing to me. The low limit is a
> > protection against external (e.g. global) memory pressure. Decreasing
> > the protection based on the external pressure sounds like it goes right
> > against the purpose of the knob. I can see reasons to update protection
> > based on refaults or other metrics from the userspace but I still do not
> > see how this is a good auto-magic tuning done by the kernel.
> >
> > > The concept behind is memcg's
> > > fault back of dropped memory is less important than system's latency
> > > on high memory pressure.
> >
> > Can you give some specific examples?
> For both of the above two comments, please refer to the latest test
> result in Patchv2 I have sent. I prefer to name my change as focus
> transfer under pressure as protected memcg is the focus when system's
> memory pressure is low which will reclaim from root, this is not
> against current design. However, when global memory pressure is high,
> then the focus has to be changed to the whole system, because it
> doesn't make sense to let the protected memcg out of everybody, it
> can't
> do anything when the system is trapped in the kernel with reclaiming work.
Does it make more sense if I describe the change as memcg will be
protect long as system pressure is under the threshold(partially
coherent with current design) and will sacrifice the memcg if pressure
is over the threshold(added change)
> >
> > > Please refer to my new version's test data
> > > for more detail.
> >
> > Please note that sending new RFCs will just make the discussion spread
> > over several email threads which will get increasingly hard to follow.
> > So do not post another version until it is really clear what is the
> > actual semantic you are proposing.
> ok, I will hold until all question done.
> >
> > --
> > Michal Hocko
> > SUSE Labs
Michal Hocko April 4, 2022, 9:32 a.m. UTC | #16
On Mon 04-04-22 17:23:43, Zhaoyang Huang wrote:
> On Mon, Apr 4, 2022 at 5:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
> >
> > On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote:
> > > [...]
> > > > > One thing that I don't understand in this approach is: why memory.low
> > > > > should depend on the system's memory pressure. It seems you want to
> > > > > allow a process to allocate more when memory pressure is high. That is
> > > > > very counter-intuitive to me. Could you please explain the underlying
> > > > > logic of why this is the right thing to do, without going into
> > > > > technical details?
> > > > What I want to achieve is make memory.low be positive correlation with
> > > > timing and negative to memory pressure, which means the protected
> > > > memcg should lower its protection(via lower memcg.low) for helping
> > > > system's memory pressure when it's high.
> > >
> > > I have to say this is still very confusing to me. The low limit is a
> > > protection against external (e.g. global) memory pressure. Decreasing
> > > the protection based on the external pressure sounds like it goes right
> > > against the purpose of the knob. I can see reasons to update protection
> > > based on refaults or other metrics from the userspace but I still do not
> > > see how this is a good auto-magic tuning done by the kernel.
> > >
> > > > The concept behind is memcg's
> > > > fault back of dropped memory is less important than system's latency
> > > > on high memory pressure.
> > >
> > > Can you give some specific examples?
> > For both of the above two comments, please refer to the latest test
> > result in Patchv2 I have sent. I prefer to name my change as focus
> > transfer under pressure as protected memcg is the focus when system's
> > memory pressure is low which will reclaim from root, this is not
> > against current design. However, when global memory pressure is high,
> > then the focus has to be changed to the whole system, because it
> > doesn't make sense to let the protected memcg out of everybody, it
> > can't
> > do anything when the system is trapped in the kernel with reclaiming work.
> Does it make more sense if I describe the change as memcg will be
> protect long as system pressure is under the threshold(partially
> coherent with current design) and will sacrifice the memcg if pressure
> is over the threshold(added change)

No, not really. For one it is still really unclear why there should be any
difference in the semantic between global and external memory pressure
in general. The low limit is always a protection from the external
pressure. And what should be the actual threshold? Amount of the reclaim
performed, effectivness of the reclaim or what?
Michal Hocko April 4, 2022, 9:36 a.m. UTC | #17
On Mon 04-04-22 11:32:28, Michal Hocko wrote:
> On Mon 04-04-22 17:23:43, Zhaoyang Huang wrote:
> > On Mon, Apr 4, 2022 at 5:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
> > >
> > > On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote:
> > > > [...]
> > > > > > One thing that I don't understand in this approach is: why memory.low
> > > > > > should depend on the system's memory pressure. It seems you want to
> > > > > > allow a process to allocate more when memory pressure is high. That is
> > > > > > very counter-intuitive to me. Could you please explain the underlying
> > > > > > logic of why this is the right thing to do, without going into
> > > > > > technical details?
> > > > > What I want to achieve is make memory.low be positive correlation with
> > > > > timing and negative to memory pressure, which means the protected
> > > > > memcg should lower its protection(via lower memcg.low) for helping
> > > > > system's memory pressure when it's high.
> > > >
> > > > I have to say this is still very confusing to me. The low limit is a
> > > > protection against external (e.g. global) memory pressure. Decreasing
> > > > the protection based on the external pressure sounds like it goes right
> > > > against the purpose of the knob. I can see reasons to update protection
> > > > based on refaults or other metrics from the userspace but I still do not
> > > > see how this is a good auto-magic tuning done by the kernel.
> > > >
> > > > > The concept behind is memcg's
> > > > > fault back of dropped memory is less important than system's latency
> > > > > on high memory pressure.
> > > >
> > > > Can you give some specific examples?
> > > For both of the above two comments, please refer to the latest test
> > > result in Patchv2 I have sent. I prefer to name my change as focus
> > > transfer under pressure as protected memcg is the focus when system's
> > > memory pressure is low which will reclaim from root, this is not
> > > against current design. However, when global memory pressure is high,
> > > then the focus has to be changed to the whole system, because it
> > > doesn't make sense to let the protected memcg out of everybody, it
> > > can't
> > > do anything when the system is trapped in the kernel with reclaiming work.
> > Does it make more sense if I describe the change as memcg will be
> > protect long as system pressure is under the threshold(partially
> > coherent with current design) and will sacrifice the memcg if pressure
> > is over the threshold(added change)
> 
> No, not really. For one it is still really unclear why there should be any
> difference in the semantic between global and external memory pressure
> in general. The low limit is always a protection from the external
> pressure. And what should be the actual threshold? Amount of the reclaim
> performed, effectivness of the reclaim or what?

Btw. you might want to have a look at http://lkml.kernel.org/r/20220331084151.2600229-1-yosryahmed@google.com
where a new interface to allow pro-active memory reclaim is discussed.
I think that this might turn out to be a better fit then an automagic
kernel manipulation with a low limit. It will require a user agent to
drive the reclaim though.
Zhaoyang Huang April 4, 2022, 11:23 a.m. UTC | #18
On Mon, Apr 4, 2022 at 5:32 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 04-04-22 17:23:43, Zhaoyang Huang wrote:
> > On Mon, Apr 4, 2022 at 5:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
> > >
> > > On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote:
> > > > [...]
> > > > > > One thing that I don't understand in this approach is: why memory.low
> > > > > > should depend on the system's memory pressure. It seems you want to
> > > > > > allow a process to allocate more when memory pressure is high. That is
> > > > > > very counter-intuitive to me. Could you please explain the underlying
> > > > > > logic of why this is the right thing to do, without going into
> > > > > > technical details?
> > > > > What I want to achieve is make memory.low be positive correlation with
> > > > > timing and negative to memory pressure, which means the protected
> > > > > memcg should lower its protection(via lower memcg.low) for helping
> > > > > system's memory pressure when it's high.
> > > >
> > > > I have to say this is still very confusing to me. The low limit is a
> > > > protection against external (e.g. global) memory pressure. Decreasing
> > > > the protection based on the external pressure sounds like it goes right
> > > > against the purpose of the knob. I can see reasons to update protection
> > > > based on refaults or other metrics from the userspace but I still do not
> > > > see how this is a good auto-magic tuning done by the kernel.
> > > >
> > > > > The concept behind is memcg's
> > > > > fault back of dropped memory is less important than system's latency
> > > > > on high memory pressure.
> > > >
> > > > Can you give some specific examples?
> > > For both of the above two comments, please refer to the latest test
> > > result in Patchv2 I have sent. I prefer to name my change as focus
> > > transfer under pressure as protected memcg is the focus when system's
> > > memory pressure is low which will reclaim from root, this is not
> > > against current design. However, when global memory pressure is high,
> > > then the focus has to be changed to the whole system, because it
> > > doesn't make sense to let the protected memcg out of everybody, it
> > > can't
> > > do anything when the system is trapped in the kernel with reclaiming work.
> > Does it make more sense if I describe the change as memcg will be
> > protect long as system pressure is under the threshold(partially
> > coherent with current design) and will sacrifice the memcg if pressure
> > is over the threshold(added change)
>
> No, not really. For one it is still really unclear why there should be any
> difference in the semantic between global and external memory pressure
> in general. The low limit is always a protection from the external
> pressure. And what should be the actual threshold? Amount of the reclaim
> performed, effectivness of the reclaim or what?
Please find bellowing for the test result, which shows current design
has more effective protection when system memory pressure is high. It
could be argued that the protected memcg lost the protection as its
usage dropped too much. I would like to say that this is just the goal
of the change. Is it reasonable to let the whole system be trapped in
memory pressure while the memcg holds the memory? With regard to
threshold, it is a dynamic decayed watermark value which represents
the historic(watermark) and present(update to new usage if it expands
again) usage. Actually, I have update the code by adding opt-in code
which means this is a opt type of the memcg. This patch is coherent to
the original design if user want to set the fixed value by default and
also provide a new way of dynamic protected memcg without external
monitor and interactivation.

We simply test above change by comparing it with current design on a v5.4 based
system in 3GB RAM in bellowing steps, via which we can find that fixed
memory.low have the system experience high memory pressure with holding too
much memory.

1. setting up the topology seperatly as [1]
2. place a memory cost process into B and have it consume 1GB memory
from userspace.
3. generating global memory pressure via mlock 1GB memory.
4. watching B's memory.current and PSI_MEM.
5. repeat 3,4 twice.

[1]. setting fixed low=500MB; low=600MB; wm_decay_factor=36(68s decay 1/2)
      A(low=500MB)
     /
    B(low=500MB)

What we observed are:

                    PSI_MEM, usage             PSI_MEM,usage
PSI_MEM,usage
                    (Mlock 1GB)                    (Mlock 2GB)
     (stable)
low=600MB   s=23 f=17 u=720/600MB   s=91 f=48 u=202MB   s=68 f=32 u=106MB
low=500MB   s=22 f=13 u=660/530MB   s=88 f=50 u=156MB   s=30 f=20 u=120MB
patch            s=23 f=12 u=692/470MB   s=40 f=23 u=67MB     s=21 f=18 u=45MB

> --
> Michal Hocko
> SUSE Labs
Zhaoyang Huang April 4, 2022, 11:35 a.m. UTC | #19
On Mon, Apr 4, 2022 at 5:36 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 04-04-22 11:32:28, Michal Hocko wrote:
> > On Mon 04-04-22 17:23:43, Zhaoyang Huang wrote:
> > > On Mon, Apr 4, 2022 at 5:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
> > > >
> > > > On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote:
> > > > > [...]
> > > > > > > One thing that I don't understand in this approach is: why memory.low
> > > > > > > should depend on the system's memory pressure. It seems you want to
> > > > > > > allow a process to allocate more when memory pressure is high. That is
> > > > > > > very counter-intuitive to me. Could you please explain the underlying
> > > > > > > logic of why this is the right thing to do, without going into
> > > > > > > technical details?
> > > > > > What I want to achieve is make memory.low be positive correlation with
> > > > > > timing and negative to memory pressure, which means the protected
> > > > > > memcg should lower its protection(via lower memcg.low) for helping
> > > > > > system's memory pressure when it's high.
> > > > >
> > > > > I have to say this is still very confusing to me. The low limit is a
> > > > > protection against external (e.g. global) memory pressure. Decreasing
> > > > > the protection based on the external pressure sounds like it goes right
> > > > > against the purpose of the knob. I can see reasons to update protection
> > > > > based on refaults or other metrics from the userspace but I still do not
> > > > > see how this is a good auto-magic tuning done by the kernel.
> > > > >
> > > > > > The concept behind is memcg's
> > > > > > fault back of dropped memory is less important than system's latency
> > > > > > on high memory pressure.
> > > > >
> > > > > Can you give some specific examples?
> > > > For both of the above two comments, please refer to the latest test
> > > > result in Patchv2 I have sent. I prefer to name my change as focus
> > > > transfer under pressure as protected memcg is the focus when system's
> > > > memory pressure is low which will reclaim from root, this is not
> > > > against current design. However, when global memory pressure is high,
> > > > then the focus has to be changed to the whole system, because it
> > > > doesn't make sense to let the protected memcg out of everybody, it
> > > > can't
> > > > do anything when the system is trapped in the kernel with reclaiming work.
> > > Does it make more sense if I describe the change as memcg will be
> > > protect long as system pressure is under the threshold(partially
> > > coherent with current design) and will sacrifice the memcg if pressure
> > > is over the threshold(added change)
> >
> > No, not really. For one it is still really unclear why there should be any
> > difference in the semantic between global and external memory pressure
> > in general. The low limit is always a protection from the external
> > pressure. And what should be the actual threshold? Amount of the reclaim
> > performed, effectivness of the reclaim or what?
>
> Btw. you might want to have a look at http://lkml.kernel.org/r/20220331084151.2600229-1-yosryahmed@google.com
> where a new interface to allow pro-active memory reclaim is discussed.
> I think that this might turn out to be a better fit then an automagic
> kernel manipulation with a low limit. It will require a user agent to
> drive the reclaim though.
Ok. But AFAIK, there are some of this kinds of method working as out
of tree code now. such as PPR in android etc. As I have replied to
Suren, there is always latency issue on this scheme as the agent
should poll the event/read current status/write to launch the action.
This patch is aiming at solve part of these issues.
> --
> Michal Hocko
> SUSE Labs
Michal Hocko April 4, 2022, 12:29 p.m. UTC | #20
On Mon 04-04-22 19:23:03, Zhaoyang Huang wrote:
> On Mon, Apr 4, 2022 at 5:32 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 04-04-22 17:23:43, Zhaoyang Huang wrote:
> > > On Mon, Apr 4, 2022 at 5:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
> > > >
> > > > On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote:
> > > > > [...]
> > > > > > > One thing that I don't understand in this approach is: why memory.low
> > > > > > > should depend on the system's memory pressure. It seems you want to
> > > > > > > allow a process to allocate more when memory pressure is high. That is
> > > > > > > very counter-intuitive to me. Could you please explain the underlying
> > > > > > > logic of why this is the right thing to do, without going into
> > > > > > > technical details?
> > > > > > What I want to achieve is make memory.low be positive correlation with
> > > > > > timing and negative to memory pressure, which means the protected
> > > > > > memcg should lower its protection(via lower memcg.low) for helping
> > > > > > system's memory pressure when it's high.
> > > > >
> > > > > I have to say this is still very confusing to me. The low limit is a
> > > > > protection against external (e.g. global) memory pressure. Decreasing
> > > > > the protection based on the external pressure sounds like it goes right
> > > > > against the purpose of the knob. I can see reasons to update protection
> > > > > based on refaults or other metrics from the userspace but I still do not
> > > > > see how this is a good auto-magic tuning done by the kernel.
> > > > >
> > > > > > The concept behind is memcg's
> > > > > > fault back of dropped memory is less important than system's latency
> > > > > > on high memory pressure.
> > > > >
> > > > > Can you give some specific examples?
> > > > For both of the above two comments, please refer to the latest test
> > > > result in Patchv2 I have sent. I prefer to name my change as focus
> > > > transfer under pressure as protected memcg is the focus when system's
> > > > memory pressure is low which will reclaim from root, this is not
> > > > against current design. However, when global memory pressure is high,
> > > > then the focus has to be changed to the whole system, because it
> > > > doesn't make sense to let the protected memcg out of everybody, it
> > > > can't
> > > > do anything when the system is trapped in the kernel with reclaiming work.
> > > Does it make more sense if I describe the change as memcg will be
> > > protect long as system pressure is under the threshold(partially
> > > coherent with current design) and will sacrifice the memcg if pressure
> > > is over the threshold(added change)
> >
> > No, not really. For one it is still really unclear why there should be any
> > difference in the semantic between global and external memory pressure
> > in general. The low limit is always a protection from the external
> > pressure. And what should be the actual threshold? Amount of the reclaim
> > performed, effectivness of the reclaim or what?
> Please find bellowing for the test result, which shows current design
> has more effective protection when system memory pressure is high. It
> could be argued that the protected memcg lost the protection as its
> usage dropped too much.

Yes, this is exactly how I do see it. The memory low/min is a
clear decision of the administrator to protect the memory consumption of
the said memcg (or hierarchy) from external memory pressure.

> I would like to say that this is just the goal
> of the change. Is it reasonable to let the whole system be trapped in
> memory pressure while the memcg holds the memory?

I would argue that this is expected and reasonable indeed. You cannot
provide a protection withtout pushing the pressure to others. The memory
is a finite resource.

> With regard to
> threshold, it is a dynamic decayed watermark value which represents
> the historic(watermark) and present(update to new usage if it expands
> again) usage. Actually, I have update the code by adding opt-in code
> which means this is a opt type of the memcg. This patch is coherent to
> the original design if user want to set the fixed value by default and
> also provide a new way of dynamic protected memcg without external
> monitor and interactivation.

The more I read here to more I am convinced that hooking into low/min
limits is simply wrong. If you want to achieve a more "clever" way to
balance memory reclaim among existing memcgs then fine but trying to
achieve that by dynamically interpreting low limits is just an abuse of
an existing interface IMO. What has led you to (ab)use low limit in the
first place?

> We simply test above change by comparing it with current design on a v5.4 based
> system in 3GB RAM in bellowing steps, via which we can find that fixed
> memory.low have the system experience high memory pressure with holding too
> much memory.
> 
> 1. setting up the topology seperatly as [1]
> 2. place a memory cost process into B and have it consume 1GB memory
> from userspace.
> 3. generating global memory pressure via mlock 1GB memory.
> 4. watching B's memory.current and PSI_MEM.
> 5. repeat 3,4 twice.

This is effectively an OOM test, isn't it? Memory low protection will
be enforced as long as there is something reclaimable but your memory
pressure is unreclaimable due to mlock so it has a stronger guarantee
than low limit so the protected memcg is going to be reclaimed.

Maybe I am just not following but this makes less and less sense as I am
reading through. So either I am missing something really significant or
we are just not on the same page.
Zhaoyang Huang April 4, 2022, 1:14 p.m. UTC | #21
On Mon, Apr 4, 2022 at 8:30 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 04-04-22 19:23:03, Zhaoyang Huang wrote:
> > On Mon, Apr 4, 2022 at 5:32 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 04-04-22 17:23:43, Zhaoyang Huang wrote:
> > > > On Mon, Apr 4, 2022 at 5:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
> > > > >
> > > > > On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote:
> > > > > > [...]
> > > > > > > > One thing that I don't understand in this approach is: why memory.low
> > > > > > > > should depend on the system's memory pressure. It seems you want to
> > > > > > > > allow a process to allocate more when memory pressure is high. That is
> > > > > > > > very counter-intuitive to me. Could you please explain the underlying
> > > > > > > > logic of why this is the right thing to do, without going into
> > > > > > > > technical details?
> > > > > > > What I want to achieve is make memory.low be positive correlation with
> > > > > > > timing and negative to memory pressure, which means the protected
> > > > > > > memcg should lower its protection(via lower memcg.low) for helping
> > > > > > > system's memory pressure when it's high.
> > > > > >
> > > > > > I have to say this is still very confusing to me. The low limit is a
> > > > > > protection against external (e.g. global) memory pressure. Decreasing
> > > > > > the protection based on the external pressure sounds like it goes right
> > > > > > against the purpose of the knob. I can see reasons to update protection
> > > > > > based on refaults or other metrics from the userspace but I still do not
> > > > > > see how this is a good auto-magic tuning done by the kernel.
> > > > > >
> > > > > > > The concept behind is memcg's
> > > > > > > fault back of dropped memory is less important than system's latency
> > > > > > > on high memory pressure.
> > > > > >
> > > > > > Can you give some specific examples?
> > > > > For both of the above two comments, please refer to the latest test
> > > > > result in Patchv2 I have sent. I prefer to name my change as focus
> > > > > transfer under pressure as protected memcg is the focus when system's
> > > > > memory pressure is low which will reclaim from root, this is not
> > > > > against current design. However, when global memory pressure is high,
> > > > > then the focus has to be changed to the whole system, because it
> > > > > doesn't make sense to let the protected memcg out of everybody, it
> > > > > can't
> > > > > do anything when the system is trapped in the kernel with reclaiming work.
> > > > Does it make more sense if I describe the change as memcg will be
> > > > protect long as system pressure is under the threshold(partially
> > > > coherent with current design) and will sacrifice the memcg if pressure
> > > > is over the threshold(added change)
> > >
> > > No, not really. For one it is still really unclear why there should be any
> > > difference in the semantic between global and external memory pressure
> > > in general. The low limit is always a protection from the external
> > > pressure. And what should be the actual threshold? Amount of the reclaim
> > > performed, effectivness of the reclaim or what?
> > Please find bellowing for the test result, which shows current design
> > has more effective protection when system memory pressure is high. It
> > could be argued that the protected memcg lost the protection as its
> > usage dropped too much.
>
> Yes, this is exactly how I do see it. The memory low/min is a
> clear decision of the administrator to protect the memory consumption of
> the said memcg (or hierarchy) from external memory pressure.
Please be noticed that this patch DOES protect the memcg when external
pressure is 1GB as fixed low does. Besides, how does the admin decide
the exact number of low/min if it expand from small to even xGB in a
quick changing scenario?
>
> > I would like to say that this is just the goal
> > of the change. Is it reasonable to let the whole system be trapped in
> > memory pressure while the memcg holds the memory?
>
> I would argue that this is expected and reasonable indeed. You cannot
> provide a protection withtout pushing the pressure to others. The memory
> is a finite resource.
Imagin you are playing a game on a 3GB mobile phone while a visual
call is connected. There could be either a dead view of the call
because of the game possess a improper low value or the game be killed
by OOM. IMHO, could we let the user have the choice to configure the
memcg as dynamic protection.
>
> > With regard to
> > threshold, it is a dynamic decayed watermark value which represents
> > the historic(watermark) and present(update to new usage if it expands
> > again) usage. Actually, I have update the code by adding opt-in code
> > which means this is a opt type of the memcg. This patch is coherent to
> > the original design if user want to set the fixed value by default and
> > also provide a new way of dynamic protected memcg without external
> > monitor and interactivation.
>
> The more I read here to more I am convinced that hooking into low/min
> limits is simply wrong. If you want to achieve a more "clever" way to
> balance memory reclaim among existing memcgs then fine but trying to
> achieve that by dynamically interpreting low limits is just an abuse of
> an existing interface IMO. What has led you to (ab)use low limit in the
> first place?
>
> > We simply test above change by comparing it with current design on a v5.4 based
> > system in 3GB RAM in bellowing steps, via which we can find that fixed
> > memory.low have the system experience high memory pressure with holding too
> > much memory.
> >
> > 1. setting up the topology seperatly as [1]
> > 2. place a memory cost process into B and have it consume 1GB memory
> > from userspace.
> > 3. generating global memory pressure via mlock 1GB memory.
> > 4. watching B's memory.current and PSI_MEM.
> > 5. repeat 3,4 twice.
>
> This is effectively an OOM test, isn't it? Memory low protection will
> be enforced as long as there is something reclaimable but your memory
> pressure is unreclaimable due to mlock so it has a stronger guarantee
> than low limit so the protected memcg is going to be reclaimed.
Mlock is for the purpose of easy for test. High pressure could be
introduced by file cache thrashing(mapped so file would have very
short refault distance) or kernel driver allocation(multimedia or AI
scenarios on android). I will retest it without Mlock then.
>
> Maybe I am just not following but this makes less and less sense as I am
> reading through. So either I am missing something really significant or
> we are just not on the same page.
As I said, I would like to have this patch provide a opt-in choice for
user to establish a kind of desired memcg while bring any change of
current design.
> --
> Michal Hocko
> SUSE Labs
Michal Hocko April 5, 2022, 12:08 p.m. UTC | #22
On Mon 04-04-22 21:14:40, Zhaoyang Huang wrote:
[...]
> Please be noticed that this patch DOES protect the memcg when external
> pressure is 1GB as fixed low does.

This is getting more and more confusing (at least to me). Could you
describe the behavior of the reclaim for the following setups/situations?

a) mostly reclaiming a clean page cache - via kswapd
b) same as above but the direct reclaim is necessary but very
   lightweight
c) direct reclaim makes fwd progress but not enough to satisfy the
   allocation request (so the reclaim has to be retried)
d) direct reclaim not making progress and low limit protection is
   ignored.

Say we have several memcgs and only some have low memory protection
configured. What is the user observable state of the protected group and
when and how much the protection can be updated?

I think it would be also helpful to describe the high level semantic of
this feature.

> Besides, how does the admin decide
> the exact number of low/min if it expand from small to even xGB in a
> quick changing scenario?

This is not really related, is it? There are different ways to tune for
the protection.

[...]
Zhaoyang Huang April 6, 2022, 2:11 a.m. UTC | #23
On Tue, Apr 5, 2022 at 8:08 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 04-04-22 21:14:40, Zhaoyang Huang wrote:
> [...]
> > Please be noticed that this patch DOES protect the memcg when external
> > pressure is 1GB as fixed low does.
>
> This is getting more and more confusing (at least to me). Could you
> describe the behavior of the reclaim for the following setups/situations?
>
> a) mostly reclaiming a clean page cache - via kswapd
> b) same as above but the direct reclaim is necessary but very
>    lightweight
> c) direct reclaim makes fwd progress but not enough to satisfy the
>    allocation request (so the reclaim has to be retried)
> d) direct reclaim not making progress and low limit protection is
>    ignored.
>
> Say we have several memcgs and only some have low memory protection
> configured. What is the user observable state of the protected group and
> when and how much the protection can be updated?
I am not sure if I understand you right. Do you have suspicions on the
test result as you think protected memcg has no chance to update the
protection or the global reclaim should have been satisfied with the
reclaiming(step d is hard to reach?). Let me try to answer it under my
understanding, please give me feedback if you need more info. The
protection is updated while mem_cgroup_calculate_protection is called
during either kswapd or direct reclaim for each round of the priority
reclaiming and then the memcg's lruvec will be reached in step d.
>
> I think it would be also helpful to describe the high level semantic of
> this feature.
>
> > Besides, how does the admin decide
> > the exact number of low/min if it expand from small to even xGB in a
> > quick changing scenario?
>
> This is not really related, is it? There are different ways to tune for
> the protection.
I don't think so. IMO, it is hard to protect when memcg has a wide and
random range of its usage especially when depending on scenarios. Does
the example of EAS on scheduler make more sense? When comparing with
the legacy CFS, EAS does be against to some original design as load
balance etc, while it will keep some small tasks into one CORE.
>
> [...]
> --
> Michal Hocko
> SUSE Labs
Zhaoyang Huang April 6, 2022, 8:21 a.m. UTC | #24
On Tue, Apr 5, 2022 at 8:08 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 04-04-22 21:14:40, Zhaoyang Huang wrote:
> [...]
> > Please be noticed that this patch DOES protect the memcg when external
> > pressure is 1GB as fixed low does.
>
> This is getting more and more confusing (at least to me). Could you
> describe the behavior of the reclaim for the following setups/situations?
>
> a) mostly reclaiming a clean page cache - via kswapd
> b) same as above but the direct reclaim is necessary but very
>    lightweight
> c) direct reclaim makes fwd progress but not enough to satisfy the
>    allocation request (so the reclaim has to be retried)
> d) direct reclaim not making progress and low limit protection is
>    ignored.
>
> Say we have several memcgs and only some have low memory protection
> configured. What is the user observable state of the protected group and
> when and how much the protection can be updated?
Ok. I guess you doubt why the external reclaiming on global LRU or
other unprotected memcg does not satisfy the requirement and have the
protected memcg have to face reclaim? According to my experience, this
is common for a large number of malloc from userspace OR high order
alloc_pages within the kernel. I have retested the previous case by
removing mlock and get the trend of result is same, where the pages on
global LRU could help to push some of the global memory pressure back
to global LRU and finally reach the protected memcg.
>
> I think it would be also helpful to describe the high level semantic of
> this feature.
>
> > Besides, how does the admin decide
> > the exact number of low/min if it expand from small to even xGB in a
> > quick changing scenario?
>
> This is not really related, is it? There are different ways to tune for
> the protection.
>
> [...]
> --
> Michal Hocko
> SUSE Labs
Michal Hocko April 7, 2022, 7:40 a.m. UTC | #25
On Wed 06-04-22 10:11:19, Zhaoyang Huang wrote:
> On Tue, Apr 5, 2022 at 8:08 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 04-04-22 21:14:40, Zhaoyang Huang wrote:
> > [...]
> > > Please be noticed that this patch DOES protect the memcg when external
> > > pressure is 1GB as fixed low does.
> >
> > This is getting more and more confusing (at least to me). Could you
> > describe the behavior of the reclaim for the following setups/situations?
> >
> > a) mostly reclaiming a clean page cache - via kswapd
> > b) same as above but the direct reclaim is necessary but very
> >    lightweight
> > c) direct reclaim makes fwd progress but not enough to satisfy the
> >    allocation request (so the reclaim has to be retried)
> > d) direct reclaim not making progress and low limit protection is
> >    ignored.
> >
> > Say we have several memcgs and only some have low memory protection
> > configured. What is the user observable state of the protected group and
> > when and how much the protection can be updated?
> I am not sure if I understand you right. Do you have suspicions on the
> test result as you think protected memcg has no chance to update the
> protection or the global reclaim should have been satisfied with the
> reclaiming(step d is hard to reach?). Let me try to answer it under my
> understanding, please give me feedback if you need more info. The
> protection is updated while mem_cgroup_calculate_protection is called
> during either kswapd or direct reclaim for each round of the priority
> reclaiming and then the memcg's lruvec will be reached in step d.

This means that limits are altered even if there is memory to be
reclaimed from other memcgs. Why? How does this line up with the
basic property of the low limit to act as a protection from the reclaim?

> > I think it would be also helpful to describe the high level semantic of
> > this feature.

Please focus on this part. Without a high level semantic explained we
will not move forward.
Zhaoyang Huang April 7, 2022, 8:59 a.m. UTC | #26
On Thu, Apr 7, 2022 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 06-04-22 10:11:19, Zhaoyang Huang wrote:
> > On Tue, Apr 5, 2022 at 8:08 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 04-04-22 21:14:40, Zhaoyang Huang wrote:
> > > [...]
> > > > Please be noticed that this patch DOES protect the memcg when external
> > > > pressure is 1GB as fixed low does.
> > >
> > > This is getting more and more confusing (at least to me). Could you
> > > describe the behavior of the reclaim for the following setups/situations?
> > >
> > > a) mostly reclaiming a clean page cache - via kswapd
> > > b) same as above but the direct reclaim is necessary but very
> > >    lightweight
> > > c) direct reclaim makes fwd progress but not enough to satisfy the
> > >    allocation request (so the reclaim has to be retried)
> > > d) direct reclaim not making progress and low limit protection is
> > >    ignored.
> > >
> > > Say we have several memcgs and only some have low memory protection
> > > configured. What is the user observable state of the protected group and
> > > when and how much the protection can be updated?
> > I am not sure if I understand you right. Do you have suspicions on the
> > test result as you think protected memcg has no chance to update the
> > protection or the global reclaim should have been satisfied with the
> > reclaiming(step d is hard to reach?). Let me try to answer it under my
> > understanding, please give me feedback if you need more info. The
> > protection is updated while mem_cgroup_calculate_protection is called
> > during either kswapd or direct reclaim for each round of the priority
> > reclaiming and then the memcg's lruvec will be reached in step d.
>
> This means that limits are altered even if there is memory to be
> reclaimed from other memcgs. Why? How does this line up with the
> basic property of the low limit to act as a protection from the reclaim?
ok, partially understand. I would like to say that low's original
definition under this patch has changed, says the calculated low just
provide protection when the psi value is lower than the setting and
will introduce reclaiming if it exceed. It also can be seen from the
bellowing latest test result(same as previous test but without mlock),
which says that the memcg with fixed low will push back the reclaim to
global LRU while keeping psi to be high. Please be noticed that the
low will be updated when usage raise up over it which means resume the
protection again when the memcg become active.

                          psi(global=1GB)        max      stable
  psi(global=2GB)     max      stable
Low=400MB      some=18 full=11       700MB 600MB       some=20 full=16
    400MB 400MB
Low=500MB      some=18 full=13       680MB 540MB       some=27 full=17
    500MB 500MB
patch setting1    some=19 full=13       863MB 740MB       some=15
full=10     500MB 500MB
patch setting1    some=14 full=11       640MB 470MB       some=20
full=12     360MB 320MB

>
> > > I think it would be also helpful to describe the high level semantic of
> > > this feature.
>
> Please focus on this part. Without a high level semantic explained we
> will not move forward.
> --
> Michal Hocko
> SUSE Labs
Michal Hocko April 7, 2022, 9:44 a.m. UTC | #27
[...]
On Thu 07-04-22 16:59:50, Zhaoyang Huang wrote:
> > This means that limits are altered even if there is memory to be
> > reclaimed from other memcgs. Why? How does this line up with the
> > basic property of the low limit to act as a protection from the reclaim?
> ok, partially understand. I would like to say that low's original
> definition under this patch has changed, says the calculated low just
> provide protection when the psi value is lower than the setting and
> will introduce reclaiming if it exceed.

OK, I guess I finally get to understand what you are trying to say. So
effectivelly your new semantic defines the low limit as an initial
protection that is very soft and only preserved under a light global
memory pressure[1]. If the reclaim pressure is higher the user provided
protection is decreased. The new semantic is planned to be a global
opt-in.

Correct?

Now, that I (believe) to have a slightly better understanding I have to
say I really dislike the idea.
First of all the new semantic would have to be memcg reclaim aware. That
means that the scaling logic would need to be aware where the memory
pressure comes from.
More importantnly I really dislike the idea that the user provided input
is updated by the kernel without userspace knowledge about that. How is
the userspace supposed to know that the value has been changed? 
I can see how the effective values could be scaled down but this still
sounds dubious as the userspace would have hard time to understand what
is going on under the cover or even worse tune the value based on the
specific observed behavior for a specific kernel which would make this a
maintenance burden going forward.

All that being said I have hard time to make sense of a protection which
is unpredictably decreasing based on a global metrics without any
userspace view into why and how this is done. So I am afraid I have to
NACK this and I really do recommend you to start a discussion about your
specific usecase and try to find a different solution.

Best regards


[1] this is based on the global PSI metric.
Zhaoyang Huang April 7, 2022, 12:36 p.m. UTC | #28
On Thu, Apr 7, 2022 at 5:44 PM Michal Hocko <mhocko@suse.com> wrote:
>
> [...]
> On Thu 07-04-22 16:59:50, Zhaoyang Huang wrote:
> > > This means that limits are altered even if there is memory to be
> > > reclaimed from other memcgs. Why? How does this line up with the
> > > basic property of the low limit to act as a protection from the reclaim?
> > ok, partially understand. I would like to say that low's original
> > definition under this patch has changed, says the calculated low just
> > provide protection when the psi value is lower than the setting and
> > will introduce reclaiming if it exceed.
>
> OK, I guess I finally get to understand what you are trying to say. So
> effectivelly your new semantic defines the low limit as an initial
> protection that is very soft and only preserved under a light global
> memory pressure[1]. If the reclaim pressure is higher the user provided
> protection is decreased. The new semantic is planned to be a global
> opt-in.
>
> Correct?
right. But I don't think the original protection is soft which could
be proved by the test result that the memcg is protected in a certain
range of pressure and could also help to release the system by
breaking low limit.
>
> Now, that I (believe) to have a slightly better understanding I have to
> say I really dislike the idea.
> First of all the new semantic would have to be memcg reclaim aware. That
> means that the scaling logic would need to be aware where the memory
> pressure comes from.
I don't follow. Does it mean that the protected should distinguish the
pressure from global and other memcgs? I don't know why.
> More importantnly I really dislike the idea that the user provided input
> is updated by the kernel without userspace knowledge about that. How is
> the userspace supposed to know that the value has been changed?
Actually, the purpose of this patch is to free the userspace during
runtime which require proper setup of parameter and then let the
scheme decide rest things.
> I can see how the effective values could be scaled down but this still
> sounds dubious as the userspace would have hard time to understand what
> is going on under the cover or even worse tune the value based on the
> specific observed behavior for a specific kernel which would make this a
> maintenance burden going forward.
This kind of memcg is supposed to be used by the user who is aware of
the scheme and would like the scheme to perform as it is.
>
> All that being said I have hard time to make sense of a protection which
> is unpredictably decreasing based on a global metrics without any
> userspace view into why and how this is done. So I am afraid I have to
> NACK this and I really do recommend you to start a discussion about your
> specific usecase and try to find a different solution.
As I have mentioned before, EAS scheduler is also a self-motivating
scheme which is based on load tracking and energy calculation. The
user could also be hard to know when the schedule entity could be
scheduled to big core. The admin could turn it off if dislike.
I would like to push this patch forward and get more feedback from
real scenarios.
>
> Best regards
>
>
> [1] this is based on the global PSI metric.
> --
> Michal Hocko
> SUSE Labs
Michal Hocko April 7, 2022, 2:14 p.m. UTC | #29
On Thu 07-04-22 20:36:51, Zhaoyang Huang wrote:
> On Thu, Apr 7, 2022 at 5:44 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > [...]
> > On Thu 07-04-22 16:59:50, Zhaoyang Huang wrote:
> > > > This means that limits are altered even if there is memory to be
> > > > reclaimed from other memcgs. Why? How does this line up with the
> > > > basic property of the low limit to act as a protection from the reclaim?
> > > ok, partially understand. I would like to say that low's original
> > > definition under this patch has changed, says the calculated low just
> > > provide protection when the psi value is lower than the setting and
> > > will introduce reclaiming if it exceed.
> >
> > OK, I guess I finally get to understand what you are trying to say. So
> > effectivelly your new semantic defines the low limit as an initial
> > protection that is very soft and only preserved under a light global
> > memory pressure[1]. If the reclaim pressure is higher the user provided
> > protection is decreased. The new semantic is planned to be a global
> > opt-in.
> >
> > Correct?
> right. But I don't think the original protection is soft which could
> be proved by the test result that the memcg is protected in a certain
> range of pressure and could also help to release the system by
> breaking low limit.

Low limit protection is considered soft because it doesn't provide any
guarantee. I will be ignored (and that will be reported to the userspace
via LOW event) if there is nothing reclaimable in the scope of the
reclaimed hierarchy. An alternative would be OOM actually.

> > Now, that I (believe) to have a slightly better understanding I have to
> > say I really dislike the idea.
> > First of all the new semantic would have to be memcg reclaim aware. That
> > means that the scaling logic would need to be aware where the memory
> > pressure comes from.
> I don't follow. Does it mean that the protected should distinguish the
> pressure from global and other memcgs? I don't know why.

No, it should behave consistently for any external memory pressure.
A reclaimed memcg can apply different constraint depending on the root
of the reclaim. Your solution is considering root to be root_memcg.

[...]
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c5c403..a510057 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -21,6 +21,9 @@ 
 #include <linux/vmstat.h>
 #include <linux/writeback.h>
 #include <linux/page-flags.h>
+#include <linux/sched/loadavg.h>
+#include <linux/sched/clock.h>
+#include <linux/psi.h>
 
 struct mem_cgroup;
 struct obj_cgroup;
@@ -28,6 +31,8 @@ 
 struct mm_struct;
 struct kmem_cache;
 
+#define MEMCG_INTERVAL	(2*HZ+1)	/* 2 sec intervals */
+
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
 	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
@@ -340,6 +345,10 @@  struct mem_cgroup {
 	struct deferred_split deferred_split_queue;
 #endif
 
+	u64 wm_dec_fact;
+	u64 avg_next_update;
+	u64 avg_last_update;
+
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
@@ -608,6 +617,47 @@  static inline bool mem_cgroup_disabled(void)
 	return !cgroup_subsys_enabled(memory_cgrp_subsys);
 }
 
+/*
+ * calculate memory.low based on the historic watermark and memory pressure
+ */
+static inline void calc_protected_low(struct mem_cgroup *group)
+{
+	u64 now, decay_factor;
+	u64 decayed_watermark;
+	u64 delta_time;
+
+	now = sched_clock();
+
+	if (!group->avg_next_update) {
+		group->avg_next_update = now + jiffies_to_nsecs(5*HZ);
+		return;
+	}
+
+	if (time_before((unsigned long)now, (unsigned long)group->avg_next_update))
+		return;
+
+	delta_time = group->avg_last_update ? now - group->avg_last_update : 0;
+	/*
+	 * we take 2048 as "1" and 68s decay 1/2(36bit) by default
+	 * decay_factor = 1024 * delta_time / 68s(0x1000000000)
+	 * 0.5(1024)/68s = decay_factor/delta_time ==> decay_factor = delta_time >> 26
+	 */
+	decay_factor = (2048 - min(2048ULL, delta_time >> (group->wm_dec_fact - 10)));
+	decayed_watermark = group->memory.decayed_watermark * decay_factor / 2048;
+	/* decay_factor: based on average memory pressure over elapsed time */
+	decay_factor = psi_mem_get(delta_time);
+	group->memory.low = decayed_watermark * (100 - decay_factor) / 100;
+
+	/*
+	 * avg_next_update: expected expire time according to current status
+	 */
+	group->memory.decayed_watermark = decayed_watermark;
+	group->avg_last_update = now;
+	group->avg_next_update = now + jiffies_to_nsecs(2*HZ);
+
+	return;
+}
+
 static inline void mem_cgroup_protection(struct mem_cgroup *root,
 					 struct mem_cgroup *memcg,
 					 unsigned long *min,
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 6795913..2720eb9f 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -25,8 +25,12 @@  struct page_counter {
 
 	/* legacy */
 	unsigned long watermark;
+	unsigned long decayed_watermark;
 	unsigned long failcnt;
 
+	/* proportional protection */
+	unsigned long min_prop;
+	unsigned long low_prop;
 	/*
 	 * 'parent' is placed here to be far from 'usage' to reduce
 	 * cache false sharing, as 'usage' is written mostly while
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 65eb147..6c76993 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -25,6 +25,8 @@  void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 
 int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
 
+unsigned long psi_mem_get(unsigned long time);
+
 #ifdef CONFIG_CGROUPS
 int psi_cgroup_alloc(struct cgroup *cgrp);
 void psi_cgroup_free(struct cgroup *cgrp);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index dd80bd2..8d315e0 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -291,6 +291,24 @@  static void get_recent_times(struct psi_group *group, int cpu,
 	}
 }
 
+unsigned long psi_mem_get(unsigned long time_ns)
+{
+	unsigned long time_sec = time_ns / (1000 * 1000 * 1000);
+	unsigned long some, full;
+	if (time_sec < 10) {
+		some = LOAD_INT(psi_system.avg[PSI_MEM * 2][0]);
+		full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][0]);
+	} else if (time_sec < 60) {
+		some = LOAD_INT(psi_system.avg[PSI_MEM * 2][1]);
+		full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][1]);
+	} else {
+		some = LOAD_INT(psi_system.avg[PSI_MEM * 2][2]);
+		full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][2]);
+	}
+
+	return (some * 768 + full * 256) / 1024;
+}
+
 static void calc_avgs(unsigned long avg[3], int missed_periods,
 		      u64 time, u64 period)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 508bcea..6b579a4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5188,6 +5188,7 @@  static struct mem_cgroup *mem_cgroup_alloc(void)
 	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
 	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
+	memcg->wm_dec_fact = 36;
 	if (parent) {
 		memcg->swappiness = mem_cgroup_swappiness(parent);
 		memcg->oom_kill_disable = parent->oom_kill_disable;
@@ -6616,6 +6617,8 @@  void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 {
 	unsigned long usage, parent_usage;
 	struct mem_cgroup *parent;
+	unsigned long memcg_emin, memcg_elow, parent_emin, parent_elow;
+	unsigned long watermark;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -6642,6 +6645,7 @@  void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 	if (!parent)
 		return;
 
+	calc_protected_low(memcg);
 	if (parent == root) {
 		memcg->memory.emin = READ_ONCE(memcg->memory.min);
 		memcg->memory.elow = READ_ONCE(memcg->memory.low);
diff --git a/mm/page_counter.c b/mm/page_counter.c
index 7d83641..18abfdd 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -83,6 +83,8 @@  void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 		 */
 		if (new > READ_ONCE(c->watermark))
 			WRITE_ONCE(c->watermark, new);
+		if (new > READ_ONCE(c->decayed_watermark))
+			WRITE_ONCE(c->decayed_watermark, new);
 	}
 }
 
@@ -137,6 +139,8 @@  bool page_counter_try_charge(struct page_counter *counter,
 		 */
 		if (new > READ_ONCE(c->watermark))
 			WRITE_ONCE(c->watermark, new);
+		if (new > READ_ONCE(c->decayed_watermark))
+			WRITE_ONCE(c->decayed_watermark, new);
 	}
 	return true;