From patchwork Fri Sep 20 22:11:51 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kaiyang Zhao X-Patchwork-Id: 13808714 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 97C65CF9C68 for ; Fri, 20 Sep 2024 22:12:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 041496B0093; Fri, 20 Sep 2024 18:12:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F0BF16B0095; Fri, 20 Sep 2024 18:12:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D69476B0096; Fri, 20 Sep 2024 18:12:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id B07DA6B0093 for ; Fri, 20 Sep 2024 18:12:57 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 5BC28403C7 for ; Fri, 20 Sep 2024 22:12:57 +0000 (UTC) X-FDA: 82586517594.08.79E5E4F Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48]) by imf22.hostedemail.com (Postfix) with ESMTP id 8F010C0003 for ; Fri, 20 Sep 2024 22:12:55 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=cs.cmu.edu header.s=google-2021 header.b=EF9YUGB2; spf=pass (imf22.hostedemail.com: domain of kaiyang2@andrew.cmu.edu designates 209.85.219.48 as permitted sender) smtp.mailfrom=kaiyang2@andrew.cmu.edu; dmarc=pass (policy=none) header.from=cs.cmu.edu ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726870260; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TftwzflgOcr42nYSagxxE5NRGUcR+2wBAeOf9BrcIxs=; b=Q46ZT9ATXwuLZME8ZKvequ6lGo7fJyF2xlICVE5k0AIEmURMrbibs3jgLlL5aKDD9WLw7n FFhGA+mEVEsTAjE0/fA9ccGV9M3LUr2S9NYO9bR8JtXoyld2jpsO0HU61m6Dzjzoop9J6B LK1aW1jBRhXy+bm60qzcrGXir+3V/mU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726870260; a=rsa-sha256; cv=none; b=dRUChLCIQWinWlzvHYVp/9X8YayruZ/0RoqneyxNg+OdRAhc/28cMk8BrjwbKdiWmBmnCI RGhj9uzegirCGozBDY1Jo+asSO20252Y0LZ7PVjCgpThMXFjmcf0XSMfGCA6vmbkZpLHAD xaBAD2j9ID8sNN5ML9wrYFp+3RPHzhE= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=cs.cmu.edu header.s=google-2021 header.b=EF9YUGB2; spf=pass (imf22.hostedemail.com: domain of kaiyang2@andrew.cmu.edu designates 209.85.219.48 as permitted sender) smtp.mailfrom=kaiyang2@andrew.cmu.edu; dmarc=pass (policy=none) header.from=cs.cmu.edu Received: by mail-qv1-f48.google.com with SMTP id 6a1803df08f44-6c3552ce7faso20946026d6.1 for ; Fri, 20 Sep 2024 15:12:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.cmu.edu; s=google-2021; t=1726870374; x=1727475174; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=TftwzflgOcr42nYSagxxE5NRGUcR+2wBAeOf9BrcIxs=; b=EF9YUGB2J8pKIZyneCDbP1g7/Se7mqiDelUq3LY7MeJuphOHry6JFySFlEscQy7jsU cUET3kemSTSCu9hff1YLxSmPKffcyxPNymVzCHZcD4yE3iM+U+XuEiRN9ySJhGeyXiZ8 NKuGtkSDLTI85COGg8buYbpmhPHcAW0jSUCgccXEufjDMYEk/fT7ks/eTY+MROLAEAj/ rg0x3gh4cwViTwH+e7W3mxStDlwqfnUrhf8VX2c1A8J0NTwjegbOoB1Iy4FIMQaKeGwf KFJHJM4JsPH2D4j/0ReuerKJDgj9dB0CSR8tBMV3hCl9Bndc+vA0XNAOlrtjfOFW6SBu 92qw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726870374; x=1727475174; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TftwzflgOcr42nYSagxxE5NRGUcR+2wBAeOf9BrcIxs=; b=hSAQLtZUV7J+L4tL0m17DO0ngjSuskcu2c9+o8l6l2Oqa1FLOgvm3ViDz39w52UXHY XH4CmFk+nxOCMgqslMIVENBoLOybLimnmGbr0AWuEFL79DfoFeGk/b0iAP2PNfqsfWd6 d3OqIXINvooz3EwaS4j4SkJqoLQvPZR6Wsfj0i93Rap+54ZqMoEGsplSXcn0xvmnnNcZ d4BnvPPKXVS3VX4/MuB6/3IQ3ioPel0KjBxo/qwPK1rW5E5QXd2GWveMowcqnxbjb1YZ bTlYB96Q2P4OH9+rLbb2r17lo0XrAtZ9E94pE65oWpX66UeCbHTblTiuJRZzTvnQnMqq Z5jA== X-Gm-Message-State: AOJu0Yy3N6iZDFEZL4vD0VOyFVq4ZTia5pZ9agGjnn09xV0IfuUAREZE DyIfITprk7YOE/nib4SjT5UjZy+aJZM2rQ8qx82KNnTLuxodkBbZ6p3TL+rUwAay3lgJcf3tdMH XDEtdelh/r+S+wZYs1gDbweDmZcMxvv9wbraXzHX5yZPqCq+R3iTCgZlSUJ3sd7MBjeV669/ixU DCctcXWXaDMhp+r3Kg6IBaSgGLWasDXV4Nyyy2rQ== X-Google-Smtp-Source: AGHT+IHikCGCO6wkPfPxYljfcKXf0g1rNsuHaxXCKqPX29rklo2k6jfuFPREyFGdHgTgjVFsYh9NRQ== X-Received: by 2002:a05:6214:419d:b0:6c5:1f00:502f with SMTP id 6a1803df08f44-6c7bb99bc41mr64018646d6.2.1726870374456; Fri, 20 Sep 2024 15:12:54 -0700 (PDT) Received: from localhost (pool-74-98-231-160.pitbpa.fios.verizon.net. [74.98.231.160]) by smtp.gmail.com with UTF8SMTPSA id 6a1803df08f44-6c75e557c5csm23290296d6.82.2024.09.20.15.12.53 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 20 Sep 2024 15:12:54 -0700 (PDT) From: kaiyang2@cs.cmu.edu To: linux-mm@kvack.org, cgroups@vger.kernel.org Cc: roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, akpm@linux-foundation.org, mhocko@kernel.org, nehagholkar@meta.com, abhishekd@meta.com, hannes@cmpxchg.org, weixugc@google.com, rientjes@google.com, Kaiyang Zhao Subject: [RFC PATCH 4/4] reduce NUMA balancing scan size of cgroups over their local memory.low Date: Fri, 20 Sep 2024 22:11:51 +0000 Message-ID: <20240920221202.1734227-5-kaiyang2@cs.cmu.edu> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240920221202.1734227-1-kaiyang2@cs.cmu.edu> References: <20240920221202.1734227-1-kaiyang2@cs.cmu.edu> MIME-Version: 1.0 X-Stat-Signature: 3yck7y77qouzy1zms9nih33e7mi7ex7q X-Rspamd-Queue-Id: 8F010C0003 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1726870375-255091 X-HE-Meta: U2FsdGVkX1/mAaHKQ/32Xi172iG6lThAJe1Ue5Nf7uD7Fg9ro43EgqfjwDHb4HlrXn1M700mQ8hpq2G3Tc2RBapCkuVyZ3k+MGyVFUtqXwWMc9orG9DDA3DrlqL3L3ahYwBUZbFClX5+5dixRfHfZfCYNFjcK9w9Wyw5WFMr9JXFcPR+VNdTUVpAa7C83NNlBzEl79NcMhSH3LIKouAPPkl53bBTu8YRkzF7nM4Cz6ijCgnIWRy9K+8TPYo45p5tPnFeQE4CBn3W3GQ5NTjNcHUzc3EB7vn5nqjkspbzpi35JxgEHkIiJvqmkCdIG9yuM+t3PrjKYryOQPdIaVv/OH1X3rtld7B+8So0aYJwPX4tJROQ6LvXZxuHFeGDK/psQIPj7u+XBSxe7CBuIglghV/A8lREv3K6KqfHYcMTjyMIv36+2Lobag5d9cSeVIi2BVrGOny788hT3zg9DAZmZBJNd1TIvpXmmy24sPJyGkvtsXp8sYxVcZ9uWpY90WIC2iskeVMdoAXE12ErVu/XCED6JLpDUNG/Cb2X5A4hG7YbwOsuG1Ojb+qmsSIhn/o5QqF3hcSrD2NY3dDAHUInCTWTnjZuhM0uqDRA4SQ2rlf5X3Fy3XaWhhcQz3qcsfjSXbzbOLb3nbuDgJ1Ll2JY7l+5TRQaIXW9sSBduEN2kE/hAEk0iauZQwDv7spERcKck6xU+vVGwljjZIotdaoYNh/mEZg2DUCV3SHNGBzLM4q43ajhv4zFfUaCmuzDX2pEfYg8dQZ/2SzDaDxV/Rqc6QUtW+qVyWadzcVlJrpEJz5OeovLUf9oOReL9PWnk9e2yzf9Js8ZcKq99Yarwy99Gnk6heHLyJ73KeKKh9Wp+lQCntgPBb4eo2x53TtpOSYhTb/tCmDD+12Gu0FC69fxL8CpbAUQn490ZS3wr5n8btivJeVpdqvGpPvxQHC/LUGXQCkz+15naDHFgNCvk7H P86Y4jqk rxR/p84utT1KQ+7hs0Tywqa344py6LXYm3rplWAsav0K4VxmgUTv5epDMr6BZKP89eCljj5J67LphdHKBnl54pR8bDH7S5+bSoILm6F+Ruehj8VSeNqp8cqddSjsPL+dK2x8Nfd/36LbkSLMQ/bGlBNdI8SvCH3FntfVcQsAwsUx+yy9xBK+gbqfqj96/k9RTVJQQmwr8f22Lt/p4y7Yf5rS9SDekZp33qopVTeEYSu51Vi2gaC6C+FXFQwe914iIRuzLsb1muij+WrUpCharK8YVrlNKUtCk3BbkmDu1WwZ50xFmZKF9y5sKfMhlhCAIsWZeP0gU825uu74qYhjFy598SmWX5sv4ByT6wTSfAd+gqTKCGxv/BQ8PRL5kbEk17X/bPYRU5xNrqzk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kaiyang Zhao When the top-tier node has less free memory than the promotion watermark, reduce the scan size of cgroups that are over their local memory.low proportional to their overage. In this case, the top-tier memory usage of the cgroup should be reduced, and demotion is working towards the goal. A smaller scan size should cause a slower rate of promotion for the cgroup so as to not working against demotion. A mininum of 1/16th of sysctl_numa_balancing_scan_size is still allowed for such cgroups because identifying hot pages trapped in slow-tier is still a worthy goal in this case (although a secondary objective). 16 is arbitrary and may need tuning. Signed-off-by: Kaiyang Zhao --- kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 49 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a1b756f927b2..1737b2369f56 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1727,14 +1727,21 @@ static inline bool cpupid_valid(int cpupid) * advantage of fast memory capacity, all recently accessed slow * memory pages will be migrated to fast memory node without * considering hot threshold. + * This is also used for detecting memory pressure and decide whether + * limitting promotion scan size is needed, for which we don't requrie + * more free pages than the promo watermark. */ -static bool pgdat_free_space_enough(struct pglist_data *pgdat) +static bool pgdat_free_space_enough(struct pglist_data *pgdat, + bool require_extra) { int z; unsigned long enough_wmark; - enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT, - pgdat->node_present_pages >> 4); + if (require_extra) + enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT, + pgdat->node_present_pages >> 4); + else + enough_wmark = 0; for (z = pgdat->nr_zones - 1; z >= 0; z--) { struct zone *zone = pgdat->node_zones + z; @@ -1846,7 +1853,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, unsigned int latency, th, def_th; pgdat = NODE_DATA(dst_nid); - if (pgdat_free_space_enough(pgdat)) { + if (pgdat_free_space_enough(pgdat, true)) { /* workload changed, reset hot threshold */ pgdat->nbp_threshold = 0; return true; @@ -3214,10 +3221,14 @@ static void task_numa_work(struct callback_head *work) struct vm_area_struct *vma; unsigned long start, end; unsigned long nr_pte_updates = 0; - long pages, virtpages; + long pages, virtpages, min_scan_pages; struct vma_iterator vmi; bool vma_pids_skipped; bool vma_pids_forced = false; + struct pglist_data *pgdat = NODE_DATA(0); /* hardcoded node 0 */ + struct mem_cgroup *memcg; + unsigned long cgroup_size, cgroup_locallow; + const long min_scan_pages_fraction = 16; /* 1/16th of the scan size */ SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work)); @@ -3262,6 +3273,39 @@ static void task_numa_work(struct callback_head *work) pages = sysctl_numa_balancing_scan_size; pages <<= 20 - PAGE_SHIFT; /* MB in pages */ + + min_scan_pages = pages; + min_scan_pages /= min_scan_pages_fraction; + + memcg = get_mem_cgroup_from_current(); + /* + * Reduce the scan size when the local node is under pressure + * (WMARK_PROMO is not satisfied), + * proportional to a cgroup's overage of local memory guarantee. + * 10% over: 68% of scan size + * 20% over: 48% of scan size + * 50% over: 20% of scan size + * 100% over: 6% of scan size + */ + if (likely(memcg)) { + if (!pgdat_free_space_enough(pgdat, false)) { + cgroup_size = get_cgroup_local_usage(memcg, false); + /* + * Protection needs refreshing, but reclaim on the cgroup + * should have refreshed recently. + */ + cgroup_locallow = READ_ONCE(memcg->memory.elocallow); + if (cgroup_size > cgroup_locallow) { + /* 1/x^4 */ + for (int i = 0; i < 4; i++) + pages = pages * cgroup_locallow / (cgroup_size + 1); + /* Lower bound to min_scan_pages. */ + pages = max(pages, min_scan_pages); + } + } + css_put(&memcg->css); + } + virtpages = pages * 8; /* Scan up to this much virtual space */ if (!pages) return;