From patchwork Fri Dec 13 19:21:58 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 11291519 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 230F518B6 for ; Fri, 13 Dec 2019 21:28:02 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 92F6724679 for ; Fri, 13 Dec 2019 21:28:01 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="n5YObZfS" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 92F6724679 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 1DE638E0017; Fri, 13 Dec 2019 14:22:14 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 192E58E0001; Fri, 13 Dec 2019 14:22:14 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EFB5A8E0017; Fri, 13 Dec 2019 14:22:13 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0029.hostedemail.com [216.40.44.29]) by kanga.kvack.org (Postfix) with ESMTP id D49BE8E0001 for ; Fri, 13 Dec 2019 14:22:13 -0500 (EST) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 7ED38181AEF0B for ; Fri, 13 Dec 2019 19:22:13 +0000 (UTC) X-FDA: 76261088946.14.ray70_87bcb0aeac110 X-Spam-Summary: 2,0,0,749eac8baf9463ce,d41d8cd98f00b204,hannes@cmpxchg.org,:akpm@linux-foundation.org:mhocko@suse.com:guro@fb.com:tj@kernel.org::cgroups@vger.kernel.org:linux-kernel@vger.kernel.org:kernel-team@fb.com,RULES_HIT:1:41:69:355:379:541:800:960:966:973:988:989:1042:1260:1311:1314:1345:1359:1437:1515:1605:1730:1747:1777:1792:1801:2194:2196:2198:2199:2200:2201:2393:2559:2562:2637:2693:2731:2897:3138:3139:3140:3141:3142:3622:3743:3865:3866:3867:3868:3870:3871:3872:3874:4250:4321:4384:4385:4395:4605:5007:6119:6261:6653:7875:7903:7904:8603:8660:8784:9040:9108:9121:10004:11026:11232:11233:11473:11658:11914:12043:12198:12291:12296:12297:12438:12517:12519:12555:12679:12683:12895:13148:13161:13229:13230:13894:14096:14394:21080:21222:21433:21444:21451:21627:21740:21772:21795:21810:21966:21990:30051:30054:30062:30070,0,RBL:209.85.222.195:@cmpxchg.org:.lbl8.mailshell.net-62.14.0.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp ,MSBL:0, X-HE-Tag: ray70_87bcb0aeac110 X-Filterd-Recvd-Size: 14765 Received: from mail-qk1-f195.google.com (mail-qk1-f195.google.com [209.85.222.195]) by imf06.hostedemail.com (Postfix) with ESMTP for ; Fri, 13 Dec 2019 19:22:12 +0000 (UTC) Received: by mail-qk1-f195.google.com with SMTP id m188so133774qkc.4 for ; Fri, 13 Dec 2019 11:22:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=JfjLwuu1VUX5Em4Gg0ccJWl380nbM9UtrKyHBfrgwE8=; b=n5YObZfSwogSAspKCGhRuxiopAeLIgPM4L8l8uuE2kVwvriV4Z2ViNt/GUI7WMkzRy lw71I4+ER6HMUWWZ2N2q5M/JzJpaM9oIRxeQUrIV7RC96951T6cEWIPoQwVUN4DvKBi/ 7vmE0de6UFE1iuhIPGea4++qpkO6VJhaOg2f2nO0zijhoPOhkAiw8yYPfTAbEmx3LABq XYw7cR+AC8e1V5lHJlXqMKKo+q1p/UTXy0PwhbaQ0RnUc7jMio8F7Oa4GUG0Lx16M/aY uySwA8jKtJ9hhtFqjzgPVqKt69Y6+jFM6eZUGVywPnBLcLfRlwGa+R4po8jj08TPxg8X b0ag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=JfjLwuu1VUX5Em4Gg0ccJWl380nbM9UtrKyHBfrgwE8=; b=hXIoDzPSkc1WRbfX5WLTdqdqDuVkGgO4/boS7ZtcG9nlRMKpyUqoTZlM4xy6bCQaPL 1P4W9AgYYjJZfGKVeuJCXMTPDjekqu9V5BaL1g4P+2vfy90bQ53MCy17MW9J15KK0BHm J21je8tHMJ97tx1BarZ/qkKIR8AcoMKZ4FOFduLGn3/Ub9EtWsEWvjWakPl+kIyFkyPR +x9p+J3ufXlitQXmbGiGjoJjLWJHGf6KxSMczYodJOHPYlZW7q9xUEleIrySzB7I5ibf JVhmCcPjbs/LWlydkiJ06geJr5eAOQQ3DRSXdycjvCDYtJTyFZ9nuTj16ScAARCc9dcE mNrQ== X-Gm-Message-State: APjAAAVEJjHVazcadMm5SjPtGLpOs6bon+Eym05bWc/1lT7rG6hsLGuN iLb9TD48RCKgWH22QLmNopZuKw== X-Google-Smtp-Source: APXvYqwCzJ0anZWYtSbPNK3I7oxeFDvlKUCWT6qSPlhDemPbrU28kpQRki59yoIKEH5Oh6vT2naFEQ== X-Received: by 2002:ae9:e910:: with SMTP id x16mr14859438qkf.90.1576264931769; Fri, 13 Dec 2019 11:22:11 -0800 (PST) Received: from localhost (70.44.39.90.res-cmts.bus.ptd.net. [70.44.39.90]) by smtp.gmail.com with ESMTPSA id h1sm3814593qte.42.2019.12.13.11.22.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 13 Dec 2019 11:22:11 -0800 (PST) From: Johannes Weiner To: Andrew Morton Cc: Michal Hocko , Roman Gushchin , Tejun Heo , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 3/3] mm: memcontrol: recursive memory.low protection Date: Fri, 13 Dec 2019 14:21:58 -0500 Message-Id: <20191213192158.188939-4-hannes@cmpxchg.org> X-Mailer: git-send-email 2.24.0 In-Reply-To: <20191213192158.188939-1-hannes@cmpxchg.org> References: <20191213192158.188939-1-hannes@cmpxchg.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Right now, the effective protection of any given cgroup is capped by its own explicit memory.low setting, regardless of what the parent says. The reasons for this are mostly historical and ease of implementation: to make delegation of memory.low safe, effective protection is the min() of all memory.low up the tree. Unfortunately, this limitation makes it impossible to protect an entire subtree from another without forcing the user to make explicit protection allocations all the way to the leaf cgroups - something that is highly undesirable in real life scenarios. Consider memory in a data center node. At the cgroup top level, we have a distinction between system management software and the actual workload the system is executing. Both branches are further subdivided into individual services, job components etc. We want to protect the workload as a whole from the system management software, but we don't want to protect individual workload components from each other! Their memory demand can vary over time, and we want the VM to simply cache the hottest data within the workload subtree. Yet, the current memory.low limitations force us to hard-allocate protection to each workload cgroup in order to get any protection from system management software. This is basically useless in practice. This patch adds the concept of recursive protection to the memory.low configurable, while retaining the abilty to assign fixed protection in leaf groups as well. That means that if protection is explicitly allocated among siblings, those configured weights are being followed during page reclaim just like they are now. However, if the memory.low set at a higher level is not fully claimed by the children in that subtree, that "floating" protection is applied to each cgroup in the tree in proportion to its size. Since reclaim pressure is applied in proportion to size as well, each child in that tree gets the same boost, and the effect is neutral among siblings - with respect to each other, they behave as if no memory control was enabled at all, and the VM simply balances the memory demands optimally within the subtree. But collectively those cgroups enjoy a boost over the cgroups in neighboring trees. This allows us to recursively protect one subtree (workload) from another (system management), but let subgroups compete freely among each other without having to assign fixed weights to each leaf. This floating protection composes with fixed protection. Consider the following example tree: A A: low = 2G / \ A1: low = 1G A1 A2 A2: low = 0G As outside pressure is applied to this tree, A1 will enjoy a fixed protection from A2 of 1G, but the remaining, unclaimed 1G from A is split evenly among A1 and A2. Assuming equal memory demand in both, memory usage will converge on A1 using 1.5G and A2 using 0.5G. There is a slight risk of regressing theoretical setups where the top-level cgroups don't know about the true budgeting and set bogusly high "bypass" values that are meaningfully allocated down the tree. Such setups would rely on unclaimed protection to be discarded, and distributing it would change the intended behavior. Be safe and hide the new behavior behind a mount option, 'memory_recursiveprot'. Signed-off-by: Johannes Weiner --- Documentation/admin-guide/cgroup-v2.rst | 11 +++++ include/linux/cgroup-defs.h | 5 ++ kernel/cgroup/cgroup.c | 17 ++++++- mm/memcontrol.c | 62 +++++++++++++++++++++++-- 4 files changed, 90 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 0636bcb60b5a..e569d83621da 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -186,6 +186,17 @@ cgroup v2 currently supports the following mount options. modified through remount from the init namespace. The mount option is ignored on non-init namespace mounts. + memory_recursiveprot + + Recursively apply memory.min and memory.low protection to + entire subtrees, without requiring explicit downward + propagation into leaf cgroups. This allows protecting entire + subtrees from one another, while retaining free competition + within those subtrees. This should have been the default + behavior but is a mount-option to avoid regressing setups + relying on the original semantics (e.g. specifying bogusly + high 'bypass' protection values at higher tree levels). + Organizing Processes and Threads -------------------------------- diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 63097cb243cb..e1fafed22db1 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -94,6 +94,11 @@ enum { * Enable legacy local memory.events. */ CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5), + + /* + * Enable recursive subtree protection + */ + CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 6), }; /* cftype->flags */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 735af8f15f95..a2f8d2ab8dec 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1813,12 +1813,14 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node, enum cgroup2_param { Opt_nsdelegate, Opt_memory_localevents, + Opt_memory_recursiveprot, nr__cgroup2_params }; static const struct fs_parameter_spec cgroup2_param_specs[] = { fsparam_flag("nsdelegate", Opt_nsdelegate), fsparam_flag("memory_localevents", Opt_memory_localevents), + fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot), {} }; @@ -1844,6 +1846,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param case Opt_memory_localevents: ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS; return 0; + case Opt_memory_recursiveprot: + ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; + return 0; } return -EINVAL; } @@ -1860,6 +1865,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags) cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS; else cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_LOCAL_EVENTS; + + if (root_flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) + cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; + else + cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT; } } @@ -1869,6 +1879,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root seq_puts(seq, ",nsdelegate"); if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS) seq_puts(seq, ",memory_localevents"); + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) + seq_puts(seq, ",memory_recursiveprot"); return 0; } @@ -6364,7 +6376,10 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate); static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return snprintf(buf, PAGE_SIZE, "nsdelegate\nmemory_localevents\n"); + return snprintf(buf, PAGE_SIZE, + "nsdelegate\n" + "memory_localevents\n" + "memory_recursiveprot\n"); } static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ac9a3a170bec..2e352cd6c38d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6254,6 +6254,32 @@ struct cgroup_subsys memory_cgrp_subsys = { * budget is NOT proportional. A cgroup's protection from a sibling * is capped to its own memory.min/low setting. * + * 5. However, to allow protecting recursive subtrees from each other + * without having to declare each individual cgroup's fixed share + * of the ancestor's claim to protection, any unutilized - + * "floating" - protection from up the tree is distributed in + * proportion to each cgroup's *usage*. This makes the protection + * neutral wrt sibling cgroups and lets them compete freely over + * the shared parental protection budget, but it protects the + * subtree as a whole from neighboring subtrees. + * + * Consider the following example tree: + * + * A A: low = 2G + * / \ B: low = 1G + * B C C: low = 0G + * + * As memory pressure is applied, the following memory distribution + * is expected (approximately): + * + * A/memory.current = 2G + * B/memory.current = 1.5G + * C/memory.current = 0.5G + * + * Note that 4. and 5. are not in conflict: 4. is about protecting + * against immediate siblings whereas 5. is about protecting against + * neighboring subtrees. + * * These calculations require constant tracking of the actual low usages * (see propagate_protected_usage()), as well as recursive calculation of * effective memory.low values. But as we do call mem_cgroup_protected() @@ -6263,11 +6289,13 @@ struct cgroup_subsys memory_cgrp_subsys = { * as memory.low is a best-effort mechanism. */ static unsigned long effective_protection(unsigned long usage, + unsigned long parent_usage, unsigned long setting, unsigned long parent_effective, unsigned long siblings_protected) { unsigned long protected; + unsigned long ep; protected = min(usage, setting); /* @@ -6298,7 +6326,31 @@ static unsigned long effective_protection(unsigned long usage, * protection is always dependent on how memory is actually * consumed among the siblings anyway. */ - return protected; + ep = protected; + + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) { + unsigned long unclaimed; + /* + * If the children aren't claiming (all of) the + * protection afforded to them by the parent, + * distribute the remainder in proportion to the + * (unprotected) size of each cgroup. That way, + * cgroups that aren't explicitly prioritized wrt each + * other compete freely over the allowance, but they + * are collectively protected from neighboring trees. + * + * We're using unprotected size for the weight so that + * if some cgroups DO claim explicit protection, we + * don't protect the same bytes twice. + */ + unclaimed = parent_effective - siblings_protected; + unclaimed *= usage - protected; + unclaimed /= parent_usage - siblings_protected; + + ep += unclaimed; + } + + return ep; } /** @@ -6318,9 +6370,9 @@ static unsigned long effective_protection(unsigned long usage, enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, struct mem_cgroup *memcg) { + unsigned long usage, parent_usage; struct mem_cgroup *parent; unsigned long emin, elow; - unsigned long usage; if (mem_cgroup_disabled()) return MEMCG_PROT_NONE; @@ -6345,11 +6397,13 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, goto out; } - memcg->memory.emin = effective_protection(usage, + parent_usage = page_counter_read(&parent->memory); + + memcg->memory.emin = effective_protection(usage, parent_usage, memcg->memory.min, READ_ONCE(parent->memory.emin), atomic_long_read(&parent->memory.children_min_usage)); - memcg->memory.elow = effective_protection(usage, + memcg->memory.elow = effective_protection(usage, parent_usage, memcg->memory.low, READ_ONCE(parent->memory.elow), atomic_long_read(&parent->memory.children_low_usage));