Message ID | 20190615111704.63901-1-xlpang@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | memcg: Ignore unprotected parent in mem_cgroup_protected() | expand |
Hi Xunlei, Xunlei Pang writes: >Currently memory.min|low implementation requires the whole >hierarchy has the settings, otherwise the protection will >be broken. > >Our hierarchy is kind of like(memory.min value in brackets), > > root > | > docker(0) > / \ > c1(max) c2(0) > >Note that "docker" doesn't set memory.min. When kswapd runs, >mem_cgroup_protected() returns "0" emin for "c1" due to "0" >@parent_emin of "docker", as a result "c1" gets reclaimed. > >But it's hard to maintain parent's "memory.min" when there're >uncertain protected children because only some important types >of containers need the protection. Further, control tasks >belonging to parent constantly reproduce trivial memory which >should not be protected at all. It makes sense to ignore >unprotected parent in this scenario to achieve the flexibility. I'm really confused by this, why don't you just set memory.{min,low} in the docker cgroup and only propagate it to the children that want it? If you only want some children to have the protection, only request it in those children, or create an additional intermediate layer of the cgroup hierarchy with protections further limited if you don't trust the task to request the right amount. Breaking the requirement for hierarchical propagation of protections seems like a really questionable API change, not least because it makes it harder to set systemwide policies about the constraints of protections within a subtree.
Hi Chirs, On 2019/6/16 AM 12:08, Chris Down wrote: > Hi Xunlei, > > Xunlei Pang writes: >> Currently memory.min|low implementation requires the whole >> hierarchy has the settings, otherwise the protection will >> be broken. >> >> Our hierarchy is kind of like(memory.min value in brackets), >> >> root >> | >> docker(0) >> / \ >> c1(max) c2(0) >> >> Note that "docker" doesn't set memory.min. When kswapd runs, >> mem_cgroup_protected() returns "0" emin for "c1" due to "0" >> @parent_emin of "docker", as a result "c1" gets reclaimed. >> >> But it's hard to maintain parent's "memory.min" when there're >> uncertain protected children because only some important types >> of containers need the protection. Further, control tasks >> belonging to parent constantly reproduce trivial memory which >> should not be protected at all. It makes sense to ignore >> unprotected parent in this scenario to achieve the flexibility. > > I'm really confused by this, why don't you just set memory.{min,low} in > the docker cgroup and only propagate it to the children that want it? > > If you only want some children to have the protection, only request it > in those children, or create an additional intermediate layer of the > cgroup hierarchy with protections further limited if you don't trust the > task to request the right amount. > > Breaking the requirement for hierarchical propagation of protections > seems like a really questionable API change, not least because it makes > it harder to set systemwide policies about the constraints of > protections within a subtree. docker and various types(different memory capacity) of containers are managed by k8s, it's a burden for k8s to maintain those dynamic figures, simply set "max" to key containers is always welcome. Set "max" to docker also protects docker cgroup memory(as docker itself has tasks) unnecessarily. This patch doesn't take effect on any intermediate layer with positive memory.min set, it requires all the ancestors having 0 memory.min to work. Nothing special change, but more flexible to business deployment...
Hi Xunlei, Xunlei Pang writes: >docker and various types(different memory capacity) of containers >are managed by k8s, it's a burden for k8s to maintain those dynamic >figures, simply set "max" to key containers is always welcome. Right, setting "max" is generally a fine way of going about it. >Set "max" to docker also protects docker cgroup memory(as docker >itself has tasks) unnecessarily. That's not correct -- leaf memcgs have to _explicitly_ request memory protection. From the documentation: memory.low [...] Best-effort memory protection. If the memory usages of a cgroup and all its ancestors are below their low boundaries, the cgroup's memory won't be reclaimed unless memory can be reclaimed from unprotected cgroups. Note the part that the cgroup itself also must be within its low boundary, which is not implied simply by having ancestors that would permit propagation of protections. In this case, Docker just shouldn't request it for those Docker-related tasks, and they won't get any. That seems a lot simpler and more intuitive than special casing "0" in ancestors. >This patch doesn't take effect on any intermediate layer with >positive memory.min set, it requires all the ancestors having >0 memory.min to work. > >Nothing special change, but more flexible to business deployment... Not so, this change is extremely "special". It violates the basic expectation that 0 means no possibility of propagation of protection, and I still don't see a compelling argument why Docker can't just set "max" in the intermediate cgroup and not accept any protection in leaf memcgs that it doesn't want protection for.
Hi Chris, On 2019/6/16 PM 6:37, Chris Down wrote: > Hi Xunlei, > > Xunlei Pang writes: >> docker and various types(different memory capacity) of containers >> are managed by k8s, it's a burden for k8s to maintain those dynamic >> figures, simply set "max" to key containers is always welcome. > > Right, setting "max" is generally a fine way of going about it. > >> Set "max" to docker also protects docker cgroup memory(as docker >> itself has tasks) unnecessarily. > > That's not correct -- leaf memcgs have to _explicitly_ request memory > protection. From the documentation: > > memory.low > > [...] > > Best-effort memory protection. If the memory usages of a > cgroup and all its ancestors are below their low boundaries, > the cgroup's memory won't be reclaimed unless memory can be > reclaimed from unprotected cgroups. > > Note the part that the cgroup itself also must be within its low > boundary, which is not implied simply by having ancestors that would > permit propagation of protections. > > In this case, Docker just shouldn't request it for those Docker-related > tasks, and they won't get any. That seems a lot simpler and more > intuitive than special casing "0" in ancestors. > >> This patch doesn't take effect on any intermediate layer with >> positive memory.min set, it requires all the ancestors having >> 0 memory.min to work. >> >> Nothing special change, but more flexible to business deployment... > > Not so, this change is extremely "special". It violates the basic > expectation that 0 means no possibility of propagation of protection, > and I still don't see a compelling argument why Docker can't just set > "max" in the intermediate cgroup and not accept any protection in leaf > memcgs that it doesn't want protection for. I got the reason, I'm using cgroup v1(with memory.min backport) which permits tasks existent in "docker" cgroup.procs. For cgroup v2, it's not a problem. Thanks, Xunlei
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index bab7e57f659b..aed7ed28b458 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -55,6 +55,8 @@ bool page_counter_try_charge(struct page_counter *counter, void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages); void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages); void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages); +bool page_counter_has_min(struct page_counter *counter); +bool page_counter_has_low(struct page_counter *counter); int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages); int page_counter_memparse(const char *buf, const char *max, unsigned long *nr_pages); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ca0bc6e6be13..f1dfa651f55d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5917,6 +5917,8 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, if (parent == root) goto exit; + if (!page_counter_has_min(&parent->memory)) + goto elow; parent_emin = READ_ONCE(parent->memory.emin); emin = min(emin, parent_emin); if (emin && parent_emin) { @@ -5931,6 +5933,9 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, siblings_min_usage); } +elow: + if (!page_counter_has_low(&parent->memory)) + goto exit; parent_elow = READ_ONCE(parent->memory.elow); elow = min(elow, parent_elow); if (elow && parent_elow) { diff --git a/mm/page_counter.c b/mm/page_counter.c index de31470655f6..8c668eae2af5 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -202,6 +202,30 @@ int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages) } } +bool page_counter_has_min(struct page_counter *counter) +{ + struct page_counter *c; + + for (c = counter; c; c = c->parent) { + if (counter->min) + return true; + } + + return false; +} + +bool page_counter_has_low(struct page_counter *counter) +{ + struct page_counter *c; + + for (c = counter; c; c = c->parent) { + if (counter->low) + return true; + } + + return false; +} + /** * page_counter_set_min - set the amount of protected memory * @counter: counter
Currently memory.min|low implementation requires the whole hierarchy has the settings, otherwise the protection will be broken. Our hierarchy is kind of like(memory.min value in brackets), root | docker(0) / \ c1(max) c2(0) Note that "docker" doesn't set memory.min. When kswapd runs, mem_cgroup_protected() returns "0" emin for "c1" due to "0" @parent_emin of "docker", as a result "c1" gets reclaimed. But it's hard to maintain parent's "memory.min" when there're uncertain protected children because only some important types of containers need the protection. Further, control tasks belonging to parent constantly reproduce trivial memory which should not be protected at all. It makes sense to ignore unprotected parent in this scenario to achieve the flexibility. In order not to break previous hierarchical behaviour, only ignore the parent when there's no protected ancestor upwards the hierarchy. Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> --- include/linux/page_counter.h | 2 ++ mm/memcontrol.c | 5 +++++ mm/page_counter.c | 24 ++++++++++++++++++++++++ 3 files changed, 31 insertions(+)