diff mbox series

[v2,2/3] mm: memcontrol: clean up and document effective low/min calculations

Message ID 20191219200718.15696-3-hannes@cmpxchg.org (mailing list archive)
State New, archived
Headers show
Series mm: memcontrol: recursive memory protection | expand

Commit Message

Johannes Weiner Dec. 19, 2019, 8:07 p.m. UTC
The effective protection of any given cgroup is a somewhat complicated
construct that depends on the ancestor's configuration, siblings'
configurations, as well as current memory utilization in all these
groups. It's done this way to satisfy hierarchical delegation
requirements while also making the configuration semantics flexible
and expressive in complex real life scenarios.

Unfortunately, all the rules and requirements are sparsely documented,
and the code is a little too clever in merging different scenarios
into a single min() expression. This makes it hard to reason about the
implementation and avoid breaking semantics when making changes to it.

This patch documents each semantic rule individually and splits out
the handling of the overcommit case from the regular case.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 190 ++++++++++++++++++++++++++----------------------
 1 file changed, 104 insertions(+), 86 deletions(-)

Comments

Michal Hocko Jan. 30, 2020, 12:54 p.m. UTC | #1
On Thu 19-12-19 15:07:17, Johannes Weiner wrote:
> The effective protection of any given cgroup is a somewhat complicated
> construct that depends on the ancestor's configuration, siblings'
> configurations, as well as current memory utilization in all these
> groups. It's done this way to satisfy hierarchical delegation
> requirements while also making the configuration semantics flexible
> and expressive in complex real life scenarios.
> 
> Unfortunately, all the rules and requirements are sparsely documented,
> and the code is a little too clever in merging different scenarios
> into a single min() expression. This makes it hard to reason about the
> implementation and avoid breaking semantics when making changes to it.
> 
> This patch documents each semantic rule individually and splits out
> the handling of the overcommit case from the regular case.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

It took me a while to double check that the new code is indeed
equivalent but the result is easier to grasp indeed.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 190 ++++++++++++++++++++++++++----------------------
>  1 file changed, 104 insertions(+), 86 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 874a0b00f89b..9c771c4d6339 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6204,65 +6204,55 @@ struct cgroup_subsys memory_cgrp_subsys = {
>  	.early_init = 0,
>  };
>  
> -/**
> - * mem_cgroup_protected - check if memory consumption is in the normal range
> - * @root: the top ancestor of the sub-tree being checked
> - * @memcg: the memory cgroup to check
> - *
> - * WARNING: This function is not stateless! It can only be used as part
> - *          of a top-down tree iteration, not for isolated queries.
> - *
> - * Returns one of the following:
> - *   MEMCG_PROT_NONE: cgroup memory is not protected
> - *   MEMCG_PROT_LOW: cgroup memory is protected as long there is
> - *     an unprotected supply of reclaimable memory from other cgroups.
> - *   MEMCG_PROT_MIN: cgroup memory is protected
> - *
> - * @root is exclusive; it is never protected when looked at directly
> +/*
> + * This function calculates an individual cgroup's effective
> + * protection which is derived from its own memory.min/low, its
> + * parent's and siblings' settings, as well as the actual memory
> + * distribution in the tree.
>   *
> - * To provide a proper hierarchical behavior, effective memory.min/low values
> - * are used. Below is the description of how effective memory.low is calculated.
> - * Effective memory.min values is calculated in the same way.
> + * The following rules apply to the effective protection values:
>   *
> - * Effective memory.low is always equal or less than the original memory.low.
> - * If there is no memory.low overcommittment (which is always true for
> - * top-level memory cgroups), these two values are equal.
> - * Otherwise, it's a part of parent's effective memory.low,
> - * calculated as a cgroup's memory.low usage divided by sum of sibling's
> - * memory.low usages, where memory.low usage is the size of actually
> - * protected memory.
> + * 1. At the first level of reclaim, effective protection is equal to
> + *    the declared protection in memory.min and memory.low.
>   *
> - *                                             low_usage
> - * elow = min( memory.low, parent->elow * ------------------ ),
> - *                                        siblings_low_usage
> + * 2. To enable safe delegation of the protection configuration, at
> + *    subsequent levels the effective protection is capped to the
> + *    parent's effective protection.
>   *
> - * low_usage = min(memory.low, memory.current)
> + * 3. To make complex and dynamic subtrees easier to configure, the
> + *    user is allowed to overcommit the declared protection at a given
> + *    level. If that is the case, the parent's effective protection is
> + *    distributed to the children in proportion to how much protection
> + *    they have declared and how much of it they are utilizing.
>   *
> + *    This makes distribution proportional, but also work-conserving:
> + *    if one cgroup claims much more protection than it uses memory,
> + *    the unused remainder is available to its siblings.
>   *
> - * Such definition of the effective memory.low provides the expected
> - * hierarchical behavior: parent's memory.low value is limiting
> - * children, unprotected memory is reclaimed first and cgroups,
> - * which are not using their guarantee do not affect actual memory
> - * distribution.
> + *    Consider the following example tree:
>   *
> - * For example, if there are memcgs A, A/B, A/C, A/D and A/E:
> + *        A      A/memory.low = 2G, A/memory.current = 6G
> + *       //\\
> + *      BC  DE   B/memory.low = 3G  B/memory.current = 2G
> + *               C/memory.low = 1G  C/memory.current = 2G
> + *               D/memory.low = 0   D/memory.current = 2G
> + *               E/memory.low = 10G E/memory.current = 0
>   *
> - *     A      A/memory.low = 2G, A/memory.current = 6G
> - *    //\\
> - *   BC  DE   B/memory.low = 3G  B/memory.current = 2G
> - *            C/memory.low = 1G  C/memory.current = 2G
> - *            D/memory.low = 0   D/memory.current = 2G
> - *            E/memory.low = 10G E/memory.current = 0
> + *    and memory pressure is applied, the following memory
> + *    distribution is expected (approximately*):
>   *
> - * and the memory pressure is applied, the following memory distribution
> - * is expected (approximately):
> + *      A/memory.current = 2G
> + *      B/memory.current = 1.3G
> + *      C/memory.current = 0.6G
> + *      D/memory.current = 0
> + *      E/memory.current = 0
>   *
> - *     A/memory.current = 2G
> + *    *assuming equal allocation rate and reclaimability
>   *
> - *     B/memory.current = 1.3G
> - *     C/memory.current = 0.6G
> - *     D/memory.current = 0
> - *     E/memory.current = 0
> + * 4. Conversely, when the declared protection is undercommitted at a
> + *    given level, the distribution of the larger parental protection
> + *    budget is NOT proportional. A cgroup's protection from a sibling
> + *    is capped to its own memory.min/low setting.
>   *
>   * These calculations require constant tracking of the actual low usages
>   * (see propagate_protected_usage()), as well as recursive calculation of
> @@ -6272,12 +6262,63 @@ struct cgroup_subsys memory_cgrp_subsys = {
>   * for next usage. This part is intentionally racy, but it's ok,
>   * as memory.low is a best-effort mechanism.
>   */
> +static unsigned long effective_protection(unsigned long usage,
> +					  unsigned long setting,
> +					  unsigned long parent_effective,
> +					  unsigned long siblings_protected)
> +{
> +	unsigned long protected;
> +
> +	protected = min(usage, setting);
> +	/*
> +	 * If all cgroups at this level combined claim and use more
> +	 * protection then what the parent affords them, distribute
> +	 * shares in proportion to utilization.
> +	 *
> +	 * We are using actual utilization rather than the statically
> +	 * claimed protection in order to be work-conserving: claimed
> +	 * but unused protection is available to siblings that would
> +	 * otherwise get a smaller chunk than what they claimed.
> +	 */
> +	if (siblings_protected > parent_effective)
> +		return protected * parent_effective / siblings_protected;
> +
> +	/*
> +	 * Ok, utilized protection of all children is within what the
> +	 * parent affords them, so we know whatever this child claims
> +	 * and utilizes is effectively protected.
> +	 *
> +	 * If there is unprotected usage beyond this value, reclaim
> +	 * will apply pressure in proportion to that amount.
> +	 *
> +	 * If there is unutilized protection, the cgroup will be fully
> +	 * shielded from reclaim, but we do return a smaller value for
> +	 * protection than what the group could enjoy in theory. This
> +	 * is okay. With the overcommit distribution above, effective
> +	 * protection is always dependent on how memory is actually
> +	 * consumed among the siblings anyway.
> +	 */
> +	return protected;
> +}
> +
> +/**
> + * mem_cgroup_protected - check if memory consumption is in the normal range
> + * @root: the top ancestor of the sub-tree being checked
> + * @memcg: the memory cgroup to check
> + *
> + * WARNING: This function is not stateless! It can only be used as part
> + *          of a top-down tree iteration, not for isolated queries.
> + *
> + * Returns one of the following:
> + *   MEMCG_PROT_NONE: cgroup memory is not protected
> + *   MEMCG_PROT_LOW: cgroup memory is protected as long there is
> + *     an unprotected supply of reclaimable memory from other cgroups.
> + *   MEMCG_PROT_MIN: cgroup memory is protected
> + */
>  enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>  						struct mem_cgroup *memcg)
>  {
>  	struct mem_cgroup *parent;
> -	unsigned long emin, parent_emin;
> -	unsigned long elow, parent_elow;
>  	unsigned long usage;
>  
>  	if (mem_cgroup_disabled())
> @@ -6292,52 +6333,29 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>  	if (!usage)
>  		return MEMCG_PROT_NONE;
>  
> -	emin = memcg->memory.min;
> -	elow = memcg->memory.low;
> -
>  	parent = parent_mem_cgroup(memcg);
>  	/* No parent means a non-hierarchical mode on v1 memcg */
>  	if (!parent)
>  		return MEMCG_PROT_NONE;
>  
> -	if (parent == root)
> -		goto exit;
> -
> -	parent_emin = READ_ONCE(parent->memory.emin);
> -	emin = min(emin, parent_emin);
> -	if (emin && parent_emin) {
> -		unsigned long min_usage, siblings_min_usage;
> -
> -		min_usage = min(usage, memcg->memory.min);
> -		siblings_min_usage = atomic_long_read(
> -			&parent->memory.children_min_usage);
> -
> -		if (min_usage && siblings_min_usage)
> -			emin = min(emin, parent_emin * min_usage /
> -				   siblings_min_usage);
> +	if (parent == root) {
> +		memcg->memory.emin = memcg->memory.min;
> +		memcg->memory.elow = memcg->memory.low;
> +		goto out;
>  	}
>  
> -	parent_elow = READ_ONCE(parent->memory.elow);
> -	elow = min(elow, parent_elow);
> -	if (elow && parent_elow) {
> -		unsigned long low_usage, siblings_low_usage;
> -
> -		low_usage = min(usage, memcg->memory.low);
> -		siblings_low_usage = atomic_long_read(
> -			&parent->memory.children_low_usage);
> +	memcg->memory.emin = effective_protection(usage,
> +			memcg->memory.min, READ_ONCE(parent->memory.emin),
> +			atomic_long_read(&parent->memory.children_min_usage));
>  
> -		if (low_usage && siblings_low_usage)
> -			elow = min(elow, parent_elow * low_usage /
> -				   siblings_low_usage);
> -	}
> -
> -exit:
> -	memcg->memory.emin = emin;
> -	memcg->memory.elow = elow;
> +	memcg->memory.elow = effective_protection(usage,
> +			memcg->memory.low, READ_ONCE(parent->memory.elow),
> +			atomic_long_read(&parent->memory.children_low_usage));
>  
> -	if (usage <= emin)
> +out:
> +	if (usage <= memcg->memory.emin)
>  		return MEMCG_PROT_MIN;
> -	else if (usage <= elow)
> +	else if (usage <= memcg->memory.elow)
>  		return MEMCG_PROT_LOW;
>  	else
>  		return MEMCG_PROT_NONE;
> -- 
> 2.24.1
>
Michal Koutný Feb. 21, 2020, 5:10 p.m. UTC | #2
On Thu, Dec 19, 2019 at 03:07:17PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> The effective protection of any given cgroup is a somewhat complicated
> construct that depends on the ancestor's configuration, siblings'
> configurations, as well as current memory utilization in all these
> groups.
I agree with that. It makes it a bit hard to determine the equilibrium
in advance.


> + *    Consider the following example tree:
>   *
> + *        A      A/memory.low = 2G, A/memory.current = 6G
> + *       //\\
> + *      BC  DE   B/memory.low = 3G  B/memory.current = 2G
> + *               C/memory.low = 1G  C/memory.current = 2G
> + *               D/memory.low = 0   D/memory.current = 2G
> + *               E/memory.low = 10G E/memory.current = 0
>   *
> + *    and memory pressure is applied, the following memory
> + *    distribution is expected (approximately*):
>   *
> + *      A/memory.current = 2G
> + *      B/memory.current = 1.3G
> + *      C/memory.current = 0.6G
> + *      D/memory.current = 0
> + *      E/memory.current = 0
>   *
> + *    *assuming equal allocation rate and reclaimability
I think the assumptions for this example don't hold (anymore).
Because reclaim rate depends on the usage above protection, the siblings
won't be reclaimed equally and so the low_usage proportionality will
change over time and the equilibrium distribution is IMO different (I'm
attaching an Octave script to calculate it).

As it depends on the initial usage, I don't think there can be given
such a general example (for overcommit).


> @@ -6272,12 +6262,63 @@ struct cgroup_subsys memory_cgrp_subsys = {
>   * for next usage. This part is intentionally racy, but it's ok,
>   * as memory.low is a best-effort mechanism.
Although it's a different issue but since this updates the docs I'm
mentioning it -- we treat memory.min the same, i.e. it's subject to the
same race, however, it's not meant to be best effort. I didn't look into
outcomes of potential misaccounting but the comment seems to miss impact
on memory.min protection.

> @@ -6292,52 +6333,29 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> [...]
> +	if (parent == root) {
> +		memcg->memory.emin = memcg->memory.min;
> +		memcg->memory.elow = memcg->memory.low;
> +		goto out;
>  	}
Shouldn't this condition be 'if (parent == root_mem_cgroup)'? (I.e. 1st
level takes direct input, but 2nd and further levels redistribute only
what they really got from parent.)


Michal
% run as: octave-cli script
%
% Input configurations
% -------------------
% E parent effective protection
% n nominal protection of siblings set at the givel level
% c current consumption -,,-

% example from effective_protection 3.
E = 2;
n = [3 1 0 10];
c = [2 2 2 0];  % this converges to      [1.16 0.84 0 0]
% c = [6 2 2 0];  % keeps ratio          [1.5 0.5 0 0]
% c = [5 2 2 0];  % mixed ratio          [1.45 0.55 0 0]
% c = [8 2 2 0];  % mixed ratio          [1.53 0.47 0 0]

% example from effective_protection 5.
%E = 2;
%n = [1 0];
%c = [2 1];  % coming from close to equilibrium  -> [1.50 0.50]
%c = [100 100];  % coming from "infinity"        -> [1.50 0.50]
%c = [2 2];   % coming from uniformity            -> [1.33 0.67]

% example of recursion by default
%E = 2;
%n = [0 0];
%c = [2 1];  % coming from disbalance            -> [1.33 0.67]
%c = [100 100];  % coming from "infinity"        -> [1.00 1.00]
%c = [2 2];   % coming from uniformity           -> [1.00 1.00]

% example by using infinities (_without_ recursive protection)
%E = 2;
%n = [1e7 1e7];
%c = [2 1];  % coming from disbalance            -> [1.33 0.67]
%c = [100 100];  % coming from "infinity"        -> [1.00 1.00]
%c = [2 2];   % coming from uniformity           -> [1.00 1.00]

% Reclaim parameters
% ------------------

% Minimal reclaim amount (GB)
cluster = 4e-6;

% Reclaim coefficient (think as 0.5^sc->priority)
alpha = .1

% Simulation parameters
% ---------------------
epsilon = 1e-7;
timeout = 1000;

% Simulation loop
% ---------------------
% Simulation assumes siblings consumed the initial amount of memory (w/out
% reclaim) and then the reclaim starts, all memory is reclaimable, i.e. treated
% same. It simulates only non-low reclaim and assumes all memory.min = 0.

ch = [];
eh = [];
rh = [];

for t = 1:timeout
	% low_usage
	u = min(c, n);
	siblings = sum(u);

	% effective_protection()
	protected = min(n, c);                % start with nominal
	e = protected * min(1, E / siblings); % normalize overcommit

	% recursive protection
	unclaimed = max(0, E - siblings);
	parent_overuse = sum(c) - siblings;
	if (unclaimed > 0 && parent_overuse > 0)
		overuse = max(0, c - protected);
		e += unclaimed * (overuse / parent_overuse);
	endif

	% get_scan_count()
	r = alpha * c;             % assume all memory is in a single LRU list

	% 1bc63fb1272b ("mm, memcg: make scan aggression always exclude protection")
	sz = max(e, c);
	r .*= (1 - (e+epsilon) ./ (sz+epsilon));

	% uncomment to debug prints
	e, c, r
	
	% nothing to reclaim, reached equilibrium
	if max(r) < epsilon
		break;
	endif

	% SWAP_CLUSTER_MAX
	r = max(r, (r > epsilon) .* cluster);
	c = max(c - r, 0);
	
	ch = [ch ; c];
	eh = [eh ; e];
	rh = [rh ; r];
endfor

t
c, e
plot([ch, eh])
pause()
Johannes Weiner Feb. 25, 2020, 6:40 p.m. UTC | #3
Hello Michal,

On Fri, Feb 21, 2020 at 06:10:24PM +0100, Michal Koutný wrote:
> On Thu, Dec 19, 2019 at 03:07:17PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > + *    Consider the following example tree:
> >   *
> > + *        A      A/memory.low = 2G, A/memory.current = 6G
> > + *       //\\
> > + *      BC  DE   B/memory.low = 3G  B/memory.current = 2G
> > + *               C/memory.low = 1G  C/memory.current = 2G
> > + *               D/memory.low = 0   D/memory.current = 2G
> > + *               E/memory.low = 10G E/memory.current = 0
> >   *
> > + *    and memory pressure is applied, the following memory
> > + *    distribution is expected (approximately*):
> >   *
> > + *      A/memory.current = 2G
> > + *      B/memory.current = 1.3G
> > + *      C/memory.current = 0.6G
> > + *      D/memory.current = 0
> > + *      E/memory.current = 0
> >   *
> > + *    *assuming equal allocation rate and reclaimability
> I think the assumptions for this example don't hold (anymore).
> Because reclaim rate depends on the usage above protection, the siblings
> won't be reclaimed equally and so the low_usage proportionality will
> change over time and the equilibrium distribution is IMO different (I'm
> attaching an Octave script to calculate it).

Hm, this example doesn't change with my patch because there is no
"floating" protection that gets distributed among the siblings.

In my testing with the above parameters, the equilibrium still comes
out to roughly this distribution.

> As it depends on the initial usage, I don't think there can be given
> such a general example (for overcommit).

It's just to illustrate the pressure weight, not to reflect each
factor that can influence the equilibrium.

I think it still has value to gain understanding of how it works, no?

> > @@ -6272,12 +6262,63 @@ struct cgroup_subsys memory_cgrp_subsys = {
> >   * for next usage. This part is intentionally racy, but it's ok,
> >   * as memory.low is a best-effort mechanism.
> Although it's a different issue but since this updates the docs I'm
> mentioning it -- we treat memory.min the same, i.e. it's subject to the
> same race, however, it's not meant to be best effort. I didn't look into
> outcomes of potential misaccounting but the comment seems to miss impact
> on memory.min protection.

Yeah I think we can delete that bit.

> > @@ -6292,52 +6333,29 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> > [...]
> > +	if (parent == root) {
> > +		memcg->memory.emin = memcg->memory.min;
> > +		memcg->memory.elow = memcg->memory.low;
> > +		goto out;
> >  	}
> Shouldn't this condition be 'if (parent == root_mem_cgroup)'? (I.e. 1st
> level takes direct input, but 2nd and further levels redistribute only
> what they really got from parent.)

I believe we cleared this up in the parallel thread, but just in case:
reclaim can happen due to a memory.max set lower in the
tree. memory.low propagation is always relative from the reclaim
scope, not the system-wide root cgroup.
Michal Koutný Feb. 26, 2020, 4:46 p.m. UTC | #4
On Tue, Feb 25, 2020 at 01:40:14PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Hm, this example doesn't change with my patch because there is no
> "floating" protection that gets distributed among the siblings.
Maybe it had changed even earlier and the example obsoleted.

> In my testing with the above parameters, the equilibrium still comes
> out to roughly this distribution.
I'm attaching my test (10-times smaller) and I'm getting these results:

> /sys/fs/cgroup/test.slice/memory.current:838750208
> /sys/fs/cgroup/test.slice/pressure.service/memory.current:616972288
> /sys/fs/cgroup/test.slice/test-A.slice/memory.current:221782016
> /sys/fs/cgroup/test.slice/test-A.slice/B.service/memory.current:123428864
> /sys/fs/cgroup/test.slice/test-A.slice/C.service/memory.current:93495296
> /sys/fs/cgroup/test.slice/test-A.slice/D.service/memory.current:4702208
> /sys/fs/cgroup/test.slice/test-A.slice/E.service/memory.current:155648

(I'm running that on 5.6.0-rc2 + first two patches of your series.)

That's IMO closer to the my simulation (1.16:0.84)
than the example prediction (1.3:0.6)

> It's just to illustrate the pressure weight, not to reflect each
> factor that can influence the equilibrium.
But it's good to have some idea about the equilibrium when configuring
the values. 

> I think it still has value to gain understanding of how it works, no?
Alas, the example confused me so that I had to write the simulation to
get grasp of it :-)

And even when running actual code now, I'd say the values in the
original example are only one of the equlibriums but definitely not
reachable from the stated initial conditions.

> > > @@ -6272,12 +6262,63 @@ struct cgroup_subsys memory_cgrp_subsys = {
> > >   * for next usage. This part is intentionally racy, but it's ok,
> > >   * as memory.low is a best-effort mechanism.
> > Although it's a different issue but since this updates the docs I'm
> > mentioning it -- we treat memory.min the same, i.e. it's subject to the
> > same race, however, it's not meant to be best effort. I didn't look into
> > outcomes of potential misaccounting but the comment seems to miss impact
> > on memory.min protection.
> 
> Yeah I think we can delete that bit.
Erm, which part?
Make the racy behavior undocumented or that it applies both memory.low
and memory.min?

> I believe we cleared this up in the parallel thread, but just in case:
> reclaim can happen due to a memory.max set lower in the
> tree. memory.low propagation is always relative from the reclaim
> scope, not the system-wide root cgroup.
Clear now.

Michal
Johannes Weiner Feb. 26, 2020, 7:40 p.m. UTC | #5
On Wed, Feb 26, 2020 at 05:46:32PM +0100, Michal Koutný wrote:
> On Tue, Feb 25, 2020 at 01:40:14PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Hm, this example doesn't change with my patch because there is no
> > "floating" protection that gets distributed among the siblings.
> Maybe it had changed even earlier and the example obsoleted.
> 
> > In my testing with the above parameters, the equilibrium still comes
> > out to roughly this distribution.
> I'm attaching my test (10-times smaller) and I'm getting these results:
> 
> > /sys/fs/cgroup/test.slice/memory.current:838750208
> > /sys/fs/cgroup/test.slice/pressure.service/memory.current:616972288
> > /sys/fs/cgroup/test.slice/test-A.slice/memory.current:221782016
> > /sys/fs/cgroup/test.slice/test-A.slice/B.service/memory.current:123428864
> > /sys/fs/cgroup/test.slice/test-A.slice/C.service/memory.current:93495296
> > /sys/fs/cgroup/test.slice/test-A.slice/D.service/memory.current:4702208
> > /sys/fs/cgroup/test.slice/test-A.slice/E.service/memory.current:155648
> 
> (I'm running that on 5.6.0-rc2 + first two patches of your series.)
> 
> That's IMO closer to the my simulation (1.16:0.84)
> than the example prediction (1.3:0.6)

I think you're correct about the moving points of equilibrium. I'm
going to experiment more with your script. I had written it off as
noise from LRU rotations, reclaim concurrency etc. but your script
shows that these points do shift around as the input parameters
change. This is useful, thanks.

But AFAICS, my patches here do not change or introduce this behavior
so it's a longer-standing issue.

> > It's just to illustrate the pressure weight, not to reflect each
> > factor that can influence the equilibrium.
> But it's good to have some idea about the equilibrium when configuring
> the values. 

Agreed.

> > > > @@ -6272,12 +6262,63 @@ struct cgroup_subsys memory_cgrp_subsys = {
> > > >   * for next usage. This part is intentionally racy, but it's ok,
> > > >   * as memory.low is a best-effort mechanism.
> > > Although it's a different issue but since this updates the docs I'm
> > > mentioning it -- we treat memory.min the same, i.e. it's subject to the
> > > same race, however, it's not meant to be best effort. I didn't look into
> > > outcomes of potential misaccounting but the comment seems to miss impact
> > > on memory.min protection.
> > 
> > Yeah I think we can delete that bit.
> Erm, which part?
> Make the racy behavior undocumented or that it applies both memory.low
> and memory.min?

I'm honestly not sure what it's trying to say. Reclaim and charging
can happen concurrently, so the protection enforcement, whether it's
min or low, has always been racy. Even OOM itself is racy.

Caching the parental value while iterating over a group of siblings
shouldn't fundamentally alter that outcome. We do enough priority
iterations and reclaim loops from the allocator that we shouldn't see
premature OOMs or apply totally incorrect balances because of that.

So IMO the statement that "this is racy, but low is best-effort
anyway, so it's okay" is misleading. I think more accurate would be to
say that reclaim is fundamentally racy, so a bit of additional noise
here doesn't matter.

Either way, I don't find this paragraph all that useful. If you think
it's informative, could you please let me know which important aspect
it communicates? Otherwise, I'm inclined to delete it.
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 874a0b00f89b..9c771c4d6339 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6204,65 +6204,55 @@  struct cgroup_subsys memory_cgrp_subsys = {
 	.early_init = 0,
 };
 
-/**
- * mem_cgroup_protected - check if memory consumption is in the normal range
- * @root: the top ancestor of the sub-tree being checked
- * @memcg: the memory cgroup to check
- *
- * WARNING: This function is not stateless! It can only be used as part
- *          of a top-down tree iteration, not for isolated queries.
- *
- * Returns one of the following:
- *   MEMCG_PROT_NONE: cgroup memory is not protected
- *   MEMCG_PROT_LOW: cgroup memory is protected as long there is
- *     an unprotected supply of reclaimable memory from other cgroups.
- *   MEMCG_PROT_MIN: cgroup memory is protected
- *
- * @root is exclusive; it is never protected when looked at directly
+/*
+ * This function calculates an individual cgroup's effective
+ * protection which is derived from its own memory.min/low, its
+ * parent's and siblings' settings, as well as the actual memory
+ * distribution in the tree.
  *
- * To provide a proper hierarchical behavior, effective memory.min/low values
- * are used. Below is the description of how effective memory.low is calculated.
- * Effective memory.min values is calculated in the same way.
+ * The following rules apply to the effective protection values:
  *
- * Effective memory.low is always equal or less than the original memory.low.
- * If there is no memory.low overcommittment (which is always true for
- * top-level memory cgroups), these two values are equal.
- * Otherwise, it's a part of parent's effective memory.low,
- * calculated as a cgroup's memory.low usage divided by sum of sibling's
- * memory.low usages, where memory.low usage is the size of actually
- * protected memory.
+ * 1. At the first level of reclaim, effective protection is equal to
+ *    the declared protection in memory.min and memory.low.
  *
- *                                             low_usage
- * elow = min( memory.low, parent->elow * ------------------ ),
- *                                        siblings_low_usage
+ * 2. To enable safe delegation of the protection configuration, at
+ *    subsequent levels the effective protection is capped to the
+ *    parent's effective protection.
  *
- * low_usage = min(memory.low, memory.current)
+ * 3. To make complex and dynamic subtrees easier to configure, the
+ *    user is allowed to overcommit the declared protection at a given
+ *    level. If that is the case, the parent's effective protection is
+ *    distributed to the children in proportion to how much protection
+ *    they have declared and how much of it they are utilizing.
  *
+ *    This makes distribution proportional, but also work-conserving:
+ *    if one cgroup claims much more protection than it uses memory,
+ *    the unused remainder is available to its siblings.
  *
- * Such definition of the effective memory.low provides the expected
- * hierarchical behavior: parent's memory.low value is limiting
- * children, unprotected memory is reclaimed first and cgroups,
- * which are not using their guarantee do not affect actual memory
- * distribution.
+ *    Consider the following example tree:
  *
- * For example, if there are memcgs A, A/B, A/C, A/D and A/E:
+ *        A      A/memory.low = 2G, A/memory.current = 6G
+ *       //\\
+ *      BC  DE   B/memory.low = 3G  B/memory.current = 2G
+ *               C/memory.low = 1G  C/memory.current = 2G
+ *               D/memory.low = 0   D/memory.current = 2G
+ *               E/memory.low = 10G E/memory.current = 0
  *
- *     A      A/memory.low = 2G, A/memory.current = 6G
- *    //\\
- *   BC  DE   B/memory.low = 3G  B/memory.current = 2G
- *            C/memory.low = 1G  C/memory.current = 2G
- *            D/memory.low = 0   D/memory.current = 2G
- *            E/memory.low = 10G E/memory.current = 0
+ *    and memory pressure is applied, the following memory
+ *    distribution is expected (approximately*):
  *
- * and the memory pressure is applied, the following memory distribution
- * is expected (approximately):
+ *      A/memory.current = 2G
+ *      B/memory.current = 1.3G
+ *      C/memory.current = 0.6G
+ *      D/memory.current = 0
+ *      E/memory.current = 0
  *
- *     A/memory.current = 2G
+ *    *assuming equal allocation rate and reclaimability
  *
- *     B/memory.current = 1.3G
- *     C/memory.current = 0.6G
- *     D/memory.current = 0
- *     E/memory.current = 0
+ * 4. Conversely, when the declared protection is undercommitted at a
+ *    given level, the distribution of the larger parental protection
+ *    budget is NOT proportional. A cgroup's protection from a sibling
+ *    is capped to its own memory.min/low setting.
  *
  * These calculations require constant tracking of the actual low usages
  * (see propagate_protected_usage()), as well as recursive calculation of
@@ -6272,12 +6262,63 @@  struct cgroup_subsys memory_cgrp_subsys = {
  * for next usage. This part is intentionally racy, but it's ok,
  * as memory.low is a best-effort mechanism.
  */
+static unsigned long effective_protection(unsigned long usage,
+					  unsigned long setting,
+					  unsigned long parent_effective,
+					  unsigned long siblings_protected)
+{
+	unsigned long protected;
+
+	protected = min(usage, setting);
+	/*
+	 * If all cgroups at this level combined claim and use more
+	 * protection then what the parent affords them, distribute
+	 * shares in proportion to utilization.
+	 *
+	 * We are using actual utilization rather than the statically
+	 * claimed protection in order to be work-conserving: claimed
+	 * but unused protection is available to siblings that would
+	 * otherwise get a smaller chunk than what they claimed.
+	 */
+	if (siblings_protected > parent_effective)
+		return protected * parent_effective / siblings_protected;
+
+	/*
+	 * Ok, utilized protection of all children is within what the
+	 * parent affords them, so we know whatever this child claims
+	 * and utilizes is effectively protected.
+	 *
+	 * If there is unprotected usage beyond this value, reclaim
+	 * will apply pressure in proportion to that amount.
+	 *
+	 * If there is unutilized protection, the cgroup will be fully
+	 * shielded from reclaim, but we do return a smaller value for
+	 * protection than what the group could enjoy in theory. This
+	 * is okay. With the overcommit distribution above, effective
+	 * protection is always dependent on how memory is actually
+	 * consumed among the siblings anyway.
+	 */
+	return protected;
+}
+
+/**
+ * mem_cgroup_protected - check if memory consumption is in the normal range
+ * @root: the top ancestor of the sub-tree being checked
+ * @memcg: the memory cgroup to check
+ *
+ * WARNING: This function is not stateless! It can only be used as part
+ *          of a top-down tree iteration, not for isolated queries.
+ *
+ * Returns one of the following:
+ *   MEMCG_PROT_NONE: cgroup memory is not protected
+ *   MEMCG_PROT_LOW: cgroup memory is protected as long there is
+ *     an unprotected supply of reclaimable memory from other cgroups.
+ *   MEMCG_PROT_MIN: cgroup memory is protected
+ */
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 						struct mem_cgroup *memcg)
 {
 	struct mem_cgroup *parent;
-	unsigned long emin, parent_emin;
-	unsigned long elow, parent_elow;
 	unsigned long usage;
 
 	if (mem_cgroup_disabled())
@@ -6292,52 +6333,29 @@  enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 	if (!usage)
 		return MEMCG_PROT_NONE;
 
-	emin = memcg->memory.min;
-	elow = memcg->memory.low;
-
 	parent = parent_mem_cgroup(memcg);
 	/* No parent means a non-hierarchical mode on v1 memcg */
 	if (!parent)
 		return MEMCG_PROT_NONE;
 
-	if (parent == root)
-		goto exit;
-
-	parent_emin = READ_ONCE(parent->memory.emin);
-	emin = min(emin, parent_emin);
-	if (emin && parent_emin) {
-		unsigned long min_usage, siblings_min_usage;
-
-		min_usage = min(usage, memcg->memory.min);
-		siblings_min_usage = atomic_long_read(
-			&parent->memory.children_min_usage);
-
-		if (min_usage && siblings_min_usage)
-			emin = min(emin, parent_emin * min_usage /
-				   siblings_min_usage);
+	if (parent == root) {
+		memcg->memory.emin = memcg->memory.min;
+		memcg->memory.elow = memcg->memory.low;
+		goto out;
 	}
 
-	parent_elow = READ_ONCE(parent->memory.elow);
-	elow = min(elow, parent_elow);
-	if (elow && parent_elow) {
-		unsigned long low_usage, siblings_low_usage;
-
-		low_usage = min(usage, memcg->memory.low);
-		siblings_low_usage = atomic_long_read(
-			&parent->memory.children_low_usage);
+	memcg->memory.emin = effective_protection(usage,
+			memcg->memory.min, READ_ONCE(parent->memory.emin),
+			atomic_long_read(&parent->memory.children_min_usage));
 
-		if (low_usage && siblings_low_usage)
-			elow = min(elow, parent_elow * low_usage /
-				   siblings_low_usage);
-	}
-
-exit:
-	memcg->memory.emin = emin;
-	memcg->memory.elow = elow;
+	memcg->memory.elow = effective_protection(usage,
+			memcg->memory.low, READ_ONCE(parent->memory.elow),
+			atomic_long_read(&parent->memory.children_low_usage));
 
-	if (usage <= emin)
+out:
+	if (usage <= memcg->memory.emin)
 		return MEMCG_PROT_MIN;
-	else if (usage <= elow)
+	else if (usage <= memcg->memory.elow)
 		return MEMCG_PROT_LOW;
 	else
 		return MEMCG_PROT_NONE;