From patchwork Wed Jan 29 19:12:04 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 13954045 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4EDD9C0218D for ; Wed, 29 Jan 2025 19:13:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7FB53280068; Wed, 29 Jan 2025 14:13:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 75D16280059; Wed, 29 Jan 2025 14:13:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5D6C6280068; Wed, 29 Jan 2025 14:13:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 35B98280059 for ; Wed, 29 Jan 2025 14:13:44 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id ACBD01C53C9 for ; Wed, 29 Jan 2025 19:13:43 +0000 (UTC) X-FDA: 83061438726.08.E7C31AE Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf24.hostedemail.com (Postfix) with ESMTP id CF6A6180004 for ; Wed, 29 Jan 2025 19:13:41 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Qt3Yq47a; spf=pass (imf24.hostedemail.com: domain of longman@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=longman@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738178022; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=ESCwOFGjc2CRHxjS2qaB93dnoCB0+FE5/DZ0ZcBVXU8=; b=Yys6ernPEr9MvS5uDKarKOGiIh2oAfUKew/ZkPBMuma+P7bWe142cj6N7J1jbeu5PRfwOI im2mnvs0Iel7Fwbz3s/WzanJegzKIRD2erLoLd/yA3dvsp6TK6UyACqysxIldQZjyyGsCO lq0Fb15JW5mI/jXe1UQJz8nErOFfLVo= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Qt3Yq47a; spf=pass (imf24.hostedemail.com: domain of longman@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=longman@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738178022; a=rsa-sha256; cv=none; b=a+MRjcwIKfHSxFk71AEdVgUpwqxda6WAcAjZE99AQkBvl5PksnkhdinK3ac0b7iDMhlQIJ ZW1j60FLUL9yv2H0MyS6UEJN4Kjh4LI5JdTPZnsiCvU7p1Q70Wm/V6cT+pw490udWHH798 Mg+vY5Jx3YUUgyI/jAQAaXvqwjB4NUM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1738178021; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=ESCwOFGjc2CRHxjS2qaB93dnoCB0+FE5/DZ0ZcBVXU8=; b=Qt3Yq47a57IbtFDHGVtVkTC/x8K8VLneu+mOaVdAtwhI5QnYBl2YqkuDS7mQ1bmUGe+wkp IA540Aj4RKtk0m1wTGLPHEY1WkPSEfe7Xp8lNz/HoH1SSitUyfKuJyBA/sl2J+awwr+73C mhVQFomcIm7F1Z241SPDVvI/xwEWyYs= Received: from mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-79-b4VoASIBNvaRCCa7iNt4nQ-1; Wed, 29 Jan 2025 14:13:35 -0500 X-MC-Unique: b4VoASIBNvaRCCa7iNt4nQ-1 X-Mimecast-MFC-AGG-ID: b4VoASIBNvaRCCa7iNt4nQ Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id DB3551956087; Wed, 29 Jan 2025 19:13:32 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.22.64.23]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 9EE8130001BE; Wed, 29 Jan 2025 19:13:28 +0000 (UTC) From: Waiman Long To: Tejun Heo , Johannes Weiner , =?utf-8?q?Michal_Koutn=C3=BD?= , Jonathan Corbet , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, Peter Hunt , Waiman Long Subject: [RFC PATCH] mm, memcg: introduce memory.high.throttle Date: Wed, 29 Jan 2025 14:12:04 -0500 Message-ID: <20250129191204.368199-1-longman@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: CF6A6180004 X-Stat-Signature: 6a1xcmr5j5eorm5yenyzob395yq6t7eu X-Rspam-User: X-HE-Tag: 1738178021-405820 X-HE-Meta: U2FsdGVkX18fVyRylnf+OA6c1NnVQNJEDbcaIb3EUWq9dYAN1TfmFcLNPX3Yhb72XpMaq0zE3rdx9AgO2OHdauF4Qc7Hjtqq2yN4zbSPDRLCmWay/RIILk7HM2a/fK1YyD/3kKk6qo9UsxOffDW8ipZrnTFcUMk4AkB6UYdQBj0sxBswW7rZHJjVilJ3Hd24rOMi5GlVCP9goqHCrieW4rxvOGtysSCP11lhmHVdkEB4DLuXVb6KLFeqro1IRU8bdVEci/oXbATOzxsd+jVVHG/PpHm+3IpfwCh04s2Eiew+IFvMpSOB/FLAV2QnOmYj2Q5hG/OXl4Aty2nEOYtEcDYFCQvQ7qSsbBS8VvOTFrHYWjvDhbkr5tOVW8YbdO2ZjaxhOEnJ6CUf8lyZFEGpazcg3isUNEpgYrMg+Kit1JkdZAheJNeW0vX0lUnyE7OK0IXwhm7BlOHqXp+Du6CAlpo3vHL5pNAfSBgZpvAv3pe1j/zo9DjPpX+rw8OjIWxS3vm4juSevZcySeZ+YtCz2cLGTTg7X1+D5hu34ATad5Md1uV0giDkyM+NOAcNIk0dz9cvt1QG62j+40YIFf3YJu59VNrlIZkF8S6JeCu1DRwXG40AvrgEe2vnMgMnR3fOKhlzZtAsSiNN56GSihW9hzQy8t9hlP/RaLA0oM221bFsxM5Qv9VX5OI/9xArHLTFyU9iC7QBPgxSurUCRdef8IHC1X1uJAGJPYtsuZr4bGuyX3L+kj1WBDaQGz5b8QptwMctFV9UwKRoga7GR/No2517tgedhXf6iSKrQVqdYDSjIpTKC/tvBo96kegVQeFrdhlkUEl0ltReVsNOjPQmWtAG6j0zP9j1Lc8R3LCvoQrMvqYclWjJno5+HVthWGRvE38PHGnqPUXjfhoQYj42ZzMYL7VzRpg3+8HAZNsg4RtoMyO3jjXT51q5v/SSC4eR3lxASKVoPBFDlLYT8jW UhzRZjTR 7lVEH7RxsIxjs7pFwL+adQEtYdtHEh+UrIoSZPvJfyY2fHLMjyxvROYiTgddDLkKV5Fpo0r4Y5HiMzwZEvoNermKvKkLktM8WRMo0mgjQXFz1TrsTL/JO/h+0jXW8b6SRIYKcHpy0+qV9P63a6g543e6FSUeLmNHz5o3XppVwwkXgI3HXJ/n5xSXsqQnG38OOBbh6IcNSeillxex0ONh9opQFn4/Lfs8yMErFciEIVY+j0dFXe0tR6IMszyyG/NRQ5VuX4oSjrzTYyu4E416NTR2I8zwP6pSV18Bfqg5c/YYcH1eYzRPPixcl3/jYmtUpxnxuYNQpo7IDAgil41ft6vfJpLtljn8DtpTL X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing reclaim over memory.high"), the amount of allocator throttling had increased substantially. As a result, it could be difficult for a misbehaving application that consumes increasing amount of memory from being OOM-killed if memory.high is set. Instead, the application may just be crawling along holding close to the allowed memory.high memory for the current memory cgroup for a very long time especially those that do a lot of memcg charging and uncharging operations. This behavior makes the upstream Kubernetes community hesitate to use memory.high. Instead, they use only memory.max for memory control similar to what is being done for cgroup v1 [1]. To allow better control of the amount of throttling and hence the speed that a misbehving task can be OOM killed, a new single-value memory.high.throttle control file is now added. The allowable range is 0-32. By default, it has a value of 0 which means maximum throttling like before. Any non-zero positive value represents the corresponding power of 2 reduction of throttling and makes OOM kills easier to happen. System administrators can now use this parameter to determine how easy they want OOM kills to happen for applications that tend to consume a lot of memory without the need to run a special userspace memory management tool to monitor memory consumption when memory.high is set. Below are the test results of a simple program showing how different values of memory.high.throttle can affect its run time (in secs) until it gets OOM killed. This test program allocates pages from kernel continuously. There are some run-to-run variations and the results are just one possible set of samples. # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \ --wait -t timeout 300 /tmp/mmap-oom memory.high.throttle service runtime -------------------- --------------- 0 120.521 1 103.376 2 85.881 3 69.698 4 42.668 5 45.782 6 22.179 7 9.909 8 5.347 9 3.100 10 1.757 11 1.084 12 0.919 13 0.650 14 0.650 15 0.655 [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0 Signed-off-by: Waiman Long --- Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++-- include/linux/memcontrol.h | 2 ++ mm/memcontrol.c | 41 +++++++++++++++++++++++++ 3 files changed, 57 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index cb1b4e759b7e..df9410ad8b3b 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back. Going over the high limit never invokes the OOM killer and under extreme conditions the limit may be breached. The high limit should be used in scenarios where an external process - monitors the limited cgroup to alleviate heavy reclaim - pressure. + monitors the limited cgroup to alleviate heavy reclaim pressure + unless a high enough value is set in "memory.high.throttle". + + memory.high.throttle + A read-write single value file which exists on non-root + cgroups. The default is 0. + + Memory usage throttle control. This value controls the amount + of throttling that will be applied when memory consumption + exceeds the "memory.high" limit. The larger the value is, + the smaller the amount of throttling will be and the easier an + offending application may get OOM killed. + + The valid range of this control file is 0-32. memory.max A read-write single value file which exists on non-root diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6e74b8254d9b..b184d7b008d4 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -199,6 +199,8 @@ struct mem_cgroup { struct list_head swap_peaks; spinlock_t peaks_lock; + int high_throttle_shift; + /* Range enforcement for interrupt charges */ struct work_struct high_work; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 46f8b372d212..2fa3fd99ebc9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2112,6 +2112,7 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask) unsigned long nr_reclaimed; unsigned int nr_pages = current->memcg_nr_pages_over_high; int nr_retries = MAX_RECLAIM_RETRIES; + int throttle_shift; struct mem_cgroup *memcg; bool in_retry = false; @@ -2156,6 +2157,13 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask) penalty_jiffies += calculate_high_delay(memcg, nr_pages, swap_find_max_overage(memcg)); + /* + * Reduce penalty according to the high_throttle_shift value. + */ + throttle_shift = READ_ONCE(memcg->high_throttle_shift); + if (throttle_shift) + penalty_jiffies >>= throttle_shift; + /* * Clamp the max delay per usermode return so as to still keep the * application moving forwards and also permit diagnostics, albeit @@ -4172,6 +4180,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, return nbytes; } +static u64 memory_high_throttle_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + return READ_ONCE(memcg->high_throttle_shift); +} + +static ssize_t memory_high_throttle_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + u64 val; + int err; + + buf = strstrip(buf); + err = kstrtoull(buf, 10, &val); + if (err) + return err; + + if (val > 32) + return -EINVAL; + + WRITE_ONCE(memcg->high_throttle_shift, (int)val); + return nbytes; +} + /* * Note: don't forget to update the 'samples/cgroup/memcg_event_listener' * if any new events become available. @@ -4396,6 +4431,12 @@ static struct cftype memory_files[] = { .seq_show = memory_high_show, .write = memory_high_write, }, + { + .name = "high.throttle", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = memory_high_throttle_read, + .write = memory_high_throttle_write, + }, { .name = "max", .flags = CFTYPE_NOT_ON_ROOT,