From patchwork Fri Apr 18 19:59:56 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shakeel Butt X-Patchwork-Id: 14057658 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DCA22C369AB for ; Fri, 18 Apr 2025 20:00:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EA80F280004; Fri, 18 Apr 2025 16:00:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E5501280002; Fri, 18 Apr 2025 16:00:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CF8E8280004; Fri, 18 Apr 2025 16:00:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id AA755280002 for ; Fri, 18 Apr 2025 16:00:14 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id B3EB15F030 for ; Fri, 18 Apr 2025 20:00:14 +0000 (UTC) X-FDA: 83348231148.02.AA080FD Received: from out-173.mta1.migadu.com (out-173.mta1.migadu.com [95.215.58.173]) by imf11.hostedemail.com (Postfix) with ESMTP id C1B2A4000F for ; Fri, 18 Apr 2025 20:00:12 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=AevEJF20; spf=pass (imf11.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.173 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745006413; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=2Na1iVU527apBOW1sOmwLdGAvVcNB8dCTxjKe4PtoUU=; b=j1F2TwucIU9o8Qo+tvxnG46Tt1sRNWyOe2qWUfzeTfzLf2CopT4fG6/ZEGRMF7xNsw7p4A hqgFhbrJKcy6pKdRnAfC5Vul/cS9Z426L71kT41T3svO4BLzbaH5ke8Wty1iG5Ob+LyPG8 VQPFv/1wVE7ovvd90I0R0zkYlmuvQ+U= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=AevEJF20; spf=pass (imf11.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.173 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745006413; a=rsa-sha256; cv=none; b=Mg28CfpJH1n15W8gC02i+kwpmjDZAj/uXvtWvHvxqH70/ezuhATQrMp2EttxMvmxhGCJAd chsfwm1bRT0qv7iPZtK9LLxKET7/Ntt9tmCieZfCkOoih3xMu2c4Rc8s/izNrRMwFkLYyK 3Zk0V7LMGpnZA90z94pW2AS8PefeW38= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745006410; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=2Na1iVU527apBOW1sOmwLdGAvVcNB8dCTxjKe4PtoUU=; b=AevEJF208O05qljzylM/N2TRNMcrG0DH9Q0fNNroc3ECn9M7ewD/MHGjwTLpa6ruLstY1g s8K9HPCBGFZp6o+jfehX/o/lZGn9vrXlBEdUabsOm2yCcKRNILmod3n/lRcBS/evu4IWwa EgBimydmDaG9WYOddbgpwjrcn20weo4= From: Shakeel Butt To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Yosry Ahmed , Tejun Heo , =?utf-8?q?Michal_Koutn=C3=BD?= , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: [PATCH] memcg: introduce non-blocking limit setting interfaces Date: Fri, 18 Apr 2025 12:59:56 -0700 Message-ID: <20250418195956.64824-1-shakeel.butt@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Stat-Signature: 8wn8rc4jgdhj6koqp7njtw3rmfythjze X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: C1B2A4000F X-Rspam-User: X-HE-Tag: 1745006412-909976 X-HE-Meta: U2FsdGVkX1/rycUjAFRKI0f8YUUzvjt0AMuaTA5EdVgZ3PYRwQl1ia7/K/FJGI+dhynKJi3TcO8jJt1FPQ2xrF1h9OmizNmqRVqnf6AxjoUkFPMQRwyBFr6KtyTrpF1cFYVQzEszSYc3o6z0lDMMH++ISGi62lCRi2pCRsfo2xmdvsmpGTwTceSUBRRDxjyW2/Za1TZiU3kUCe5APlelFg8jO7r/wIOkWhkHDyARrpCJi97J0eunNLv90WAqypFNjrQTWHZ6g7l83ZhoPBknlsS7Fi8KV6xi6t7FokrbWZdXvC36MSnZrSgIs2373J83EG4e2TVgGJ5YdK7itiT/kmtfvwMIwoocpZI2LLjRf3f+SsMhcZDFz4RoDYvshkfaQsEObzzJAANi1Rbu0fE0yZKeVqht6NzK6vEivj/cj2lnbJXrAM335Y7ux3khdJiUuGV77UABiqwudKiKJ7sfB+rJy/r9KpswLesSfJ/+EZ6RNuJN34yu0CrNzogRO7MieLoeCs7MhiREAfV9x289Lop8Zt7H959zyodPd6jMc4ZQIeQDc3AZDAypE93f5jmDrnvte2+1WrKEW/mwqeRC0hZFNgf6rGvZAsVQV3uLCKa8E5U4uew52VHFVLybBBWB2QJvH7qTPhCH16AIYy8Ueesz31f3FD9a3P/bQz+M+Izb2wH/Ui5rMIYKR0d0ptULwfZCwNP2qhWvhbC/9LD++CTw6u1zre8Mr0ypSAkaywGeDob7IafKb3ovzpbvvSkVItgVLvHNeKBYkJ6HT7GxGX7sMg8Y91N1sDoqSMgDn1G95En6N0c6gW93ozSbJ/u9qB8af10LdllzXvWONI8TVkiv+kpUyxw6Pt5obBDhX9tUeNW84+UCSfuFmbkD3JuvhIAegcGT59U/lSIwNdH3iHtX+FwVi5AFihJvZCDq5xYbokpYtN1d8EtLq/xt5stiRibbsmtKXLfFnRxHleA V2fmUVB6 8/FZC5AIT1F6qcNl13IouTjGtMjXR4hG3CJuXyHI/z4ObdoFk6pZlZtgLlygb64tg59ww7vagi54I33vaASibQIeM6GGKHFZXBPkImxg7tcMzFn8eXWyF4tR/t+d5jJ58t1eGa1y9DJwLJmvXRVSgPSji18/kx/3nzTzzMahvQKt8yPw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Setting the max and high limits can trigger synchronous reclaim and/or oom-kill if the usage is higher than the given limit. This behavior is fine for newly created cgroups but it can cause issues for the node controller while setting limits for existing cgroups. In our production multi-tenant and overcommitted environment, we are seeing priority inversion when the node controller dynamically adjusts the limits of running jobs of different priorities. Based on the system situation, the node controller may reduce the limits of lower priority jobs and increase the limits of higher priority jobs. However we are seeing node controller getting stuck for long period of time while reclaiming from lower priority jobs while setting their limits and also spends a lot of its own CPU. One of the workaround we are trying is to fork a new process which sets the limit of the lower priority job along with setting an alarm to get itself killed if it get stuck in the reclaim for lower priority job. However we are finding it very unreliable and costly. Either we need a good enough time buffer for the alarm to be delivered after setting limit and potentialy spend a lot of CPU in the reclaim or be unreliable in setting the limit for much shorter but cheaper (less reclaim) alarms. Let's introduce new limit setting interfaces which does not trigger reclaim and/or oom-kill and let the processes in the target cgroup to trigger reclaim and/or throttling and/or oom-kill in their next charge request. This will make the node controller on multi-tenant overcommitted environment much more reliable. Signed-off-by: Shakeel Butt --- Documentation/admin-guide/cgroup-v2.rst | 16 +++++++++ mm/memcontrol.c | 46 +++++++++++++++++++++++++ 2 files changed, 62 insertions(+) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8fb14ffab7d1..7b459c821afa 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1299,6 +1299,14 @@ PAGE_SIZE multiple when read back. monitors the limited cgroup to alleviate heavy reclaim pressure. + memory.high.nonblock + This is the same limit as memory.high but have different + behaviour for the writer of this interface. The program setting + the limit will not trigger reclaim synchronously if the + usage is higher than the limit and let the processes in the + target cgroup to trigger reclaim and/or get throttled on + hitting the high limit. + memory.max A read-write single value file which exists on non-root cgroups. The default is "max". @@ -1316,6 +1324,14 @@ PAGE_SIZE multiple when read back. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead. + memory.max.nonblock + This is the same limit as memory.max but have different + behaviour for the writer of this interface. The program setting + the limit will not trigger reclaim synchronously and/or trigger + the oom-kill if the usage is higher than the limit and let the + processes in the target cgroup to trigger reclaim and/or get + oom-killed on hitting their max limit. + memory.reclaim A write-only nested-keyed file which exists for all cgroups. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5e2ea8b8a898..6ad1464b621a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4279,6 +4279,23 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, return nbytes; } +static ssize_t memory_high_nonblock_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned long high; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "max", &high); + if (err) + return err; + + page_counter_set_high(&memcg->memory, high); + memcg_wb_domain_size_changed(memcg); + return nbytes; +} + static int memory_max_show(struct seq_file *m, void *v) { return seq_puts_memcg_tunable(m, @@ -4333,6 +4350,23 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, return nbytes; } +static ssize_t memory_max_nonblock_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned long max; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "max", &max); + if (err) + return err; + + xchg(&memcg->memory.max, max); + memcg_wb_domain_size_changed(memcg); + return nbytes; +} + /* * Note: don't forget to update the 'samples/cgroup/memcg_event_listener' * if any new events become available. @@ -4557,12 +4591,24 @@ static struct cftype memory_files[] = { .seq_show = memory_high_show, .write = memory_high_write, }, + { + .name = "high.nonblock", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_high_show, + .write = memory_high_nonblock_write, + }, { .name = "max", .flags = CFTYPE_NOT_ON_ROOT, .seq_show = memory_max_show, .write = memory_max_write, }, + { + .name = "max.nonblock", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_max_show, + .write = memory_max_nonblock_write, + }, { .name = "events", .flags = CFTYPE_NOT_ON_ROOT,