From patchwork Wed Nov 30 02:03:27 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13059393 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E48CBC433FE for ; Wed, 30 Nov 2022 02:03:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4E63B6B0072; Tue, 29 Nov 2022 21:03:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 496736B0073; Tue, 29 Nov 2022 21:03:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 385486B0074; Tue, 29 Nov 2022 21:03:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 29EF36B0072 for ; Tue, 29 Nov 2022 21:03:36 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id CC4DEAAB4D for ; Wed, 30 Nov 2022 02:03:35 +0000 (UTC) X-FDA: 80188461990.07.D6E8DA3 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) by imf26.hostedemail.com (Postfix) with ESMTP id 699CD140011 for ; Wed, 30 Nov 2022 02:03:35 +0000 (UTC) Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-37360a6236fso142414977b3.12 for ; Tue, 29 Nov 2022 18:03:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=qySU/YYSJyV1Gig5PuzcugPI3DQny2PGxjqX/sS0z2A=; b=a7ok51aioMeHNiJSu7WtcVreWg/hIHdE7BTzM7xqt0k47iaIVEw0p+PKilUz+Z5Tsl 2bQOfpVf6meh/wYFWCctnSFxSpCbqFFPHBt5Nx0JBytQZ0/IliRi6rBDoMjhvyJd7AhH ztB6iXeQE6BMroMQu8TBfxxXA2Ppm6c6oqSaNqTXvTK7OFI0mKSXpd38JrYmZBxE0Fct EHaaFE7e+qLy5rV6dHmXgMmqwzhxz5EJdyUk4lQL8I6UEpKrA0X0yTgSHiZiWUr/rorG O94Fpmf4kbvmxVvBMTIu7yKpC7llbZOk/E1EWf8UhfpN5UlttHcJCrkpJLwc8P352EeK IYcw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=qySU/YYSJyV1Gig5PuzcugPI3DQny2PGxjqX/sS0z2A=; b=ixKwRKg+B1kabsBj3Goea5bB6k7o6eymSIDXWJU3oG981DV5UDABvm3je7kUjZDYJh lIKWnfZw7X1xWk6rQmtQsmHfWKxVvFrxLrNN9LZq3SYomoRjfjX94exqC71MJfYulRgq Y0Qdjb2lMOb2E1bAd8zpFNy1t7C7ZSzHKDBOJn5A1xxmjfAnKImQOzgjjSViT6upK0iP 4ufqWHvJmQsmNGGu5A8WuEzOWEMxAfxUDu+0NQXiFQ4grHGjni2TLbeg51NrYCsyoE7/ KhMHKJ3DYfOStZ5nZZSshTZ42YkN3O1PuxEy7S/wG/M3p58++jNliTnC1FYN6f4I7Bfi 5v7A== X-Gm-Message-State: ANoB5pkriNbDvSR5IGIjVtRw65MMBRaHE5WhqZWJ4yZMoSXDMzIZJrnO OjVgxysPu7Ra1yzhx8XwZGM3q4wjBnLCg0ixNw== X-Google-Smtp-Source: AA0mqf7g8lJGoWb8KEPiq3MhQbh+oBKm+lFB3ObIBN3lGaF8mBPfoAWa/O48jVf8luC/UX/oIap+IknPhLZDdG+NAg== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2d4:203:5ef5:d4df:2c34:14d1]) (user=almasrymina job=sendgmr) by 2002:a25:ba92:0:b0:6d3:66bb:786b with SMTP id s18-20020a25ba92000000b006d366bb786bmr58770981ybg.459.1669773814553; Tue, 29 Nov 2022 18:03:34 -0800 (PST) Date: Tue, 29 Nov 2022 18:03:27 -0800 Mime-Version: 1.0 X-Mailer: git-send-email 2.38.1.584.g0f3c55d4c2-goog Message-ID: <20221130020328.1009347-1-almasrymina@google.com> Subject: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim From: Mina Almasry To: Huang Ying , Yang Shi , Yosry Ahmed , Tim Chen , weixugc@google.com, shakeelb@google.com, gthelen@google.com, fvdl@google.com, Tejun Heo , Zefan Li , Johannes Weiner , Jonathan Corbet , Michal Hocko , Roman Gushchin , Muchun Song , Andrew Morton Cc: Mina Almasry , cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1669773815; a=rsa-sha256; cv=none; b=NayY0c/hiZBajjmQtwSARVGUHzzryT0YjP9wX9dSPnJ2DALgYV91JqRzbxt6iuAAE8dZgB gSiV+Ur/bRg0BwBiCr2kThBTQnHBJiAVpQyXPpQc8hPuqzB3lauv30EJDlSc6HIaRi4z6K BzUhUGKVxEqUViZHmBi/MjVNadYnKLU= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=a7ok51ai; spf=pass (imf26.hostedemail.com: domain of 39rmGYwsKCKcHSTHZYfTPUHNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--almasrymina.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=39rmGYwsKCKcHSTHZYfTPUHNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--almasrymina.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1669773815; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=qySU/YYSJyV1Gig5PuzcugPI3DQny2PGxjqX/sS0z2A=; b=L5jcHNIv44x0hdy3WNAokwIxrCN9zSK2FewTNGwiZYkhF8VBwQU9h4V6FaXXoWLHKBJE9J zo6eRcOYl7WQ+k/kxzBtWf6KDR9GcZdXlIv9knZaxR//CqfNFGUGz3enti9hGKL1+iegr2 mfbgm/S72vyujJrP+B++qboVyrsrmGM= X-Stat-Signature: nubafo7cthtp98qiu8e5dz8rpiu9mj7g X-Rspamd-Queue-Id: 699CD140011 X-Rspam-User: Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=a7ok51ai; spf=pass (imf26.hostedemail.com: domain of 39rmGYwsKCKcHSTHZYfTPUHNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--almasrymina.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=39rmGYwsKCKcHSTHZYfTPUHNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--almasrymina.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam05 X-HE-Tag: 1669773815-580135 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The nodes= arg instructs the kernel to only scan the given nodes for proactive reclaim. For example use cases, consider a 2 tier memory system: nodes 0,1 -> top tier nodes 2,3 -> second tier $ echo "1m nodes=0" > memory.reclaim This instructs the kernel to attempt to reclaim 1m memory from node 0. Since node 0 is a top tier node, demotion will be attempted first. This is useful to direct proactive reclaim to specific nodes that are under pressure. $ echo "1m nodes=2,3" > memory.reclaim This instructs the kernel to attempt to reclaim 1m memory in the second tier, since this tier of memory has no demotion targets the memory will be reclaimed. $ echo "1m nodes=0,1" > memory.reclaim Instructs the kernel to reclaim memory from the top tier nodes, which can be desirable according to the userspace policy if there is pressure on the top tiers. Since these nodes have demotion targets, the kernel will attempt demotion first. Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), the proactive reclaim interface memory.reclaim does both reclaim and demotion. Reclaim and demotion incur different latency costs to the jobs in the cgroup. Demoted memory would still be addressable by the userspace at a higher latency, but reclaimed memory would need to incur a pagefault. The 'nodes' arg is useful to allow the userspace to control demotion and reclaim independently according to its policy: if the memory.reclaim is called on a node with demotion targets, it will attempt demotion first; if it is called on a node without demotion targets, it will only attempt reclaim. Signed-off-by: Mina Almasry Acked-by: Michal Hocko --- Documentation/admin-guide/cgroup-v2.rst | 15 +++--- include/linux/swap.h | 3 +- mm/memcontrol.c | 67 ++++++++++++++++++++----- mm/vmscan.c | 4 +- 4 files changed, 68 insertions(+), 21 deletions(-) -- 2.38.1.584.g0f3c55d4c2-goog diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 74cec76be9f2..ac5fcbcd5ae6 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back. This is a simple interface to trigger memory reclaim in the target cgroup. - This file accepts a single key, the number of bytes to reclaim. - No nested keys are currently supported. + This file accepts a string which contains the number of bytes to + reclaim. Example:: echo "1G" > memory.reclaim - The interface can be later extended with nested keys to - configure the reclaim behavior. For example, specify the - type of memory to reclaim from (anon, file, ..). - Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned. @@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim. + This file also allows the user to specify the nodes to reclaim from, + via the 'nodes=' key, example:: + + echo "1G nodes=0,1" > memory.reclaim + + The above instructs the kernel to reclaim memory from nodes 0,1. + memory.peak A read-only single value file which exists on non-root cgroups. diff --git a/include/linux/swap.h b/include/linux/swap.h index b61e2007d156..f542c114dffd 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -419,7 +419,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options); + unsigned int reclaim_options, + nodemask_t nodemask); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 23750cec0036..a0d7850173a9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -63,6 +63,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -2392,7 +2393,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEMCG_RECLAIM_MAY_SWAP); + MEMCG_RECLAIM_MAY_SWAP, + NODE_MASK_ALL); psi_memstall_leave(&pflags); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2683,7 +2685,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, psi_memstall_enter(&pflags); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, reclaim_options); + gfp_mask, reclaim_options, + NODE_MASK_ALL); psi_memstall_leave(&pflags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) @@ -3503,7 +3506,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, } if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, + NODE_MASK_ALL)) { ret = -EBUSY; break; } @@ -3614,7 +3618,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) return -EINTR; if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - MEMCG_RECLAIM_MAY_SWAP)) + MEMCG_RECLAIM_MAY_SWAP, + NODE_MASK_ALL)) nr_retries--; } @@ -6407,7 +6412,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, } reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, + NODE_MASK_ALL); if (!reclaimed && !nr_retries--) break; @@ -6456,7 +6462,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, + NODE_MASK_ALL)) nr_reclaims--; continue; } @@ -6579,21 +6586,54 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } +enum { + MEMORY_RECLAIM_NODES = 0, + MEMORY_RECLAIM_NULL, +}; + +static const match_table_t if_tokens = { + { MEMORY_RECLAIM_NODES, "nodes=%s" }, + { MEMORY_RECLAIM_NULL, NULL }, +}; + static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; - unsigned int reclaim_options; - int err; + unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP | + MEMCG_RECLAIM_PROACTIVE; + char *old_buf, *start; + substring_t args[MAX_OPT_ARGS]; + int token; + char value[256]; + nodemask_t nodemask = NODE_MASK_ALL; buf = strstrip(buf); - err = page_counter_memparse(buf, "", &nr_to_reclaim); - if (err) - return err; - reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; + old_buf = buf; + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; + if (buf == old_buf) + return -EINVAL; + + buf = strstrip(buf); + + while ((start = strsep(&buf, " ")) != NULL) { + if (!strlen(start)) + continue; + token = match_token(start, if_tokens, args); + match_strlcpy(value, args, sizeof(value)); + switch (token) { + case MEMORY_RECLAIM_NODES: + if (nodelist_parse(value, nodemask) < 0) + return -EINVAL; + break; + default: + return -EINVAL; + } + } + while (nr_reclaimed < nr_to_reclaim) { unsigned long reclaimed; @@ -6610,7 +6650,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_to_reclaim - nr_reclaimed, - GFP_KERNEL, reclaim_options); + GFP_KERNEL, reclaim_options, + nodemask); if (!reclaimed && !nr_retries--) return -EAGAIN; diff --git a/mm/vmscan.c b/mm/vmscan.c index 7b8e8e43806b..23fc5b523764 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options) + unsigned int reclaim_options, + nodemask_t nodemask) { unsigned long nr_reclaimed; unsigned int noreclaim_flag; @@ -6750,6 +6751,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_unmap = 1, .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), + .nodemask = &nodemask, }; /* * Traverse the ZONELIST_FALLBACK zonelist of the current node to put