From patchwork Thu Mar 20 21:09:19 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 14024545 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4524FC28B30 for ; Thu, 20 Mar 2025 21:09:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DAFAB280002; Thu, 20 Mar 2025 17:09:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D5FE7280001; Thu, 20 Mar 2025 17:09:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C0216280002; Thu, 20 Mar 2025 17:09:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 9F354280001 for ; Thu, 20 Mar 2025 17:09:26 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 384AB140A73 for ; Thu, 20 Mar 2025 21:09:27 +0000 (UTC) X-FDA: 83243170374.13.86FA2DE Received: from mail-qv1-f49.google.com (mail-qv1-f49.google.com [209.85.219.49]) by imf08.hostedemail.com (Postfix) with ESMTP id 3BB73160014 for ; Thu, 20 Mar 2025 21:09:25 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=bct8QoLO; spf=pass (imf08.hostedemail.com: domain of gourry@gourry.net designates 209.85.219.49 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742504965; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=0yrEskFz3SGebHv+YiQL9wB7TFOwxFXa/530qki8uYY=; b=6EBQSOIP/EZAJWMDCUJS7oQ5u37k19XIwFfkgPsbz+IHLseg9PFivREWFUW86uZulQzZGT czIEmxbPide4VSir9sJWX+yaevbiSL2mFnN31dRAN6n32KU0+NhRgZsuZOemclD8t2zP6G cspxsXV8YbcSKYfU5i/nx1eRuYVybGk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742504965; a=rsa-sha256; cv=none; b=zMpMtGi/kfINrwTo4XkBIAMMcPG8VWB93qNH1718TDUBw3BVz/dM6A15nrE2v5US915ohb NBizMzmXlaWu4mR56kbdNT92aYbJ3Xn9KMis7Ngpa26Ob/sV7Tzna9QXAZab9S4N7lRcVL /27S9TXwr+BCi6SwZ69BPDpV6r9IEhQ= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=bct8QoLO; spf=pass (imf08.hostedemail.com: domain of gourry@gourry.net designates 209.85.219.49 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none Received: by mail-qv1-f49.google.com with SMTP id 6a1803df08f44-6e91d323346so13344776d6.1 for ; Thu, 20 Mar 2025 14:09:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1742504964; x=1743109764; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=0yrEskFz3SGebHv+YiQL9wB7TFOwxFXa/530qki8uYY=; b=bct8QoLOt4ABiLF5SqH9LufF8PKuvDHrhhLE0YhR27KBCyFB7VG1cLzCqq8qsgXCw5 kmARNAl4RTFdKdHsdkrONE4KQlmzqKY0QzpIwMwpmcyy8oVOycwl6lb6rpuMo2k7SUQc 92YlxbmFlAX6TLRlMSJarVhZwrGrC6fZs++daOOVSgy7Mizc+g9Waeglg5eV+myP4NRf MPNLr7KCsDrYrQqkSG+c4imdxpL+0Ope8oFKPIfUyEOzn3vzzIB6txEvBIZtvMgKS4qA V3mi7ecS0Xr5LayCJUHuWFhJ3hK1dKKlSSwkSDSyjPTKfpVnz1dHEGWpisnzaLTfuy5E Oujw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742504964; x=1743109764; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=0yrEskFz3SGebHv+YiQL9wB7TFOwxFXa/530qki8uYY=; b=nnBAgMLyKmVAZBkh84yvzp2fbdCLbd9rkbPsExrjw3nF2LEwHG0PFy0/8WXd+0VcoQ IVRnJNXuGCBaUK2GT3BnhwGAtoRD1LsOR7YnPiJu+zS5k5lkyegqghB+3SV7dImBPzVQ 6PEGxZci2X2kYHosKa4YUcua4MrCcbsMyNA/mz+D6lz5hdvdFFCH/hjZvRDDAeV0mOv6 LFoPJJuUKmAmbKh0UOdT7ocyYqXW657yd1Fmz0igh901sXf0xuGjQQ4kMOn/itAIl9NW vymu0ARKqiYw8xtZ4i7Lm7b2p8vVi9/0oEqZqBJY5qAm4nTfkEs5aPmJARzSaEZ6zmFA QKbg== X-Gm-Message-State: AOJu0YztmtZMSmifzWxwdU62XYV+/HVO3Px0G//YHQJTzNk9YmAnEUwL 5rkdw5k+ni9IbofjBoFV/HjHxNmHRy7l+vyicK3Td6JJFimeoW2Eig6MAQi5UdPWh/+PJBSLT0G c X-Gm-Gg: ASbGncvpvS+49g6gJWIHDryeF+l268iMCJLZIv5TEMJREX4Ax9sGavgxcBz8mFbJHer 2OIaN28JmV20j5vqxjmcSjcMngBlbaw8ETEuMOTCYuUzzQWclyTRDi2bHi5APKmnN0cEMNEYNL6 Yp/IOU4ZeEt0fy2iYilSrYmv2iXWbo0e9h59b+/f+Xvpf38lohsSlCcjM0EKRWYLa0BJKYYe9oj lYEmMVFhnU+ywMZ/ifIsxSACzXbv00RyegCalheBPlqlgr11bng60UZJyAmle/UBFLsUnQsTqPw RTNeO3ngAB/91gFd6TyzYyv6g2OlBguy4wiUgwhGPXc/6n1iPrT3AWrJW32KobXIFQh/gnT44i7 qLf7whzjFX8/IcVnxWVHwF70c4kTWwZdT X-Google-Smtp-Source: AGHT+IEjJfoLWfjaQSMLv4pamkzSKWduPwtwMCPVDIeLP4SOBEfwmmhRDQOnEv81Tvlit0t1bVNB2w== X-Received: by 2002:a05:6214:29e7:b0:6e8:ec18:a1be with SMTP id 6a1803df08f44-6eb349305d5mr64846176d6.7.1742504963955; Thu, 20 Mar 2025 14:09:23 -0700 (PDT) Received: from gourry-fedora-PF4VCD3F.lan (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6eb3efdbe35sm2593626d6.105.2025.03.20.14.09.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Mar 2025 14:09:23 -0700 (PDT) From: Gregory Price To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, longman@redhat.com, tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, akpm@linux-foundation.org Subject: [RFC PATCH] vmscan,cgroup: apply mems_effective to reclaim Date: Thu, 20 Mar 2025 17:09:19 -0400 Message-ID: <20250320210919.439964-1-gourry@gourry.net> X-Mailer: git-send-email 2.48.1 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 3BB73160014 X-Stat-Signature: yrnup8xfye3cmjjfzm1o5o74n57yhmbz X-HE-Tag: 1742504965-965825 X-HE-Meta: U2FsdGVkX1+VWbxPCWrt/XGVYGMnm0eoFZGk7A0BdkEdWlU9KgCQf/WT5PvmgWIyRb5j9yMDY5NpkdbC9qzTmvLjif9f9mxsB/oqJ37RsJc7+iqybYKkYZJqWty0niOSYnZWeG406p6KO+FuxbqddxwcJkA3d8dolPEcwZoo7fEBvIU5pZHqwjcbMFXFlZ7QBzArX2fnvuJ9eLUVMd78HNJta4EQDVBCE1o83cv+cmwjolyoVDNn6vW+m//8gEOvTajF6UeaQ/owZ0bYGUkjLviJJD/ggjMS9DiDU0uAtd78PDrzvqSnvqMF4OVW4Gr+5ZuWFirA3qt8vAEcjx2CXrAEjB+Zbd+K47kTi6Dcvu9CIjEIQDc/BvTFPEAyna8Z3c7fDKEpB8UkdzmE2uTNqwgVuFvOLnD8E/IsKUGrsckkHQEhOus+N3SrhetU//ohdFfp1hDnqSsoNOoR54nuHyIU7XrNaGArsFd61Bcv01Z/WiSQAKRyvnmCZ+FMIAJJWKPc8XEm6MAm8LIc5j0esdPfBRSAaHFb1eiX04gdx31mV+9KkRmkWdK0dSBzRfoR5XwAQAQLB+cjAntAqM9uOgf5+XV5c3MGrFtWUNkpFQsWDQwRqrEHvzf7ewyCvz05PqnBCkb/UYx2Y0KOYJ+Nbo43YuqVkn8c515c2iStr2a4JlrSelmQh6av2g3FImmwW9z9uJfY1rn8iWJFoI7VE1+9cWNKpLy30DYfe0CT+sh7kdibnTyoTzFF/5csCpfaKe/6H/ykfHdG9sH1kYQYDpYEWWl+9NVo3jj44jP2BiBAPN+z/QqfgXuIfbzfNIlwX5T4NKrac8rTVaIq0j0DMc51dSqBHPksnA93FLWG89erSVdQeG5y4b0lNE2Psqid+b+uj0bSAatz9Ju7iEE3Y4RbjV3ehTIB0XHoE4zXfy6FKEizrjEvAUXz+299z50hyD9ecOeDugaSjC+qnWK qRWkuHhi OK93iDt63dBHwzke2c0bctyw+Y3cNfHMshMn5LMmD3W75zFDcIFHpE6x1fqqr8ylRS+pWF/6U7CkMI9Kbo8DPmnLXrgySVhJfvGneBkq0TE+iMPNtXABPyHFKWSfPQIiq6CNKtPSCqY/WTMCD9QOuo5f5eckyIcpNt8s/zPQ8wEirTd8EZXId/S/ztpBYF7xS0oH7BCAOI/2Z4+m7uDctgFP7Mrm0yS+49RKfWap7PpEt4TRrsnFskRXxS7WfE+vKdhU0lgRp5yuyW3xawy+boIUnTq1dO9BqK8Ay7HsJNISMLvyYosuojnft5LytWzSATMmXb7g+f7x9D6Sp9dBdIWVyIykQadGcPIFk X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: It is possible for a reclaimer to cause demotions of an lruvec belonging to a cgroup with cpuset.mems set to exclude some nodes. Attempt to apply this limitation based on the lruvec's memcg and prevent demotion. Notably, this may still allow demotion of shared libraries or any memory first instantiated in another cgroup. This means cpusets still cannot cannot guarantee complete isolation when demotion is enabled, and the docs have been updated to reflect this. Note: This is a fairly hacked up method that probably overlooks some cgroup/cpuset controls or designs. RFCing now for some discussion at LSFMM '25. Signed-off-by: Gregory Price --- .../ABI/testing/sysfs-kernel-mm-numa | 14 +++++--- include/linux/cpuset.h | 2 ++ kernel/cgroup/cpuset.c | 10 ++++++ mm/vmscan.c | 32 ++++++++++++------- 4 files changed, 41 insertions(+), 17 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa index 77e559d4ed80..27cdcab901f7 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-numa +++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa @@ -16,9 +16,13 @@ Description: Enable/disable demoting pages during reclaim Allowing page migration during reclaim enables these systems to migrate pages from fast tiers to slow tiers when the fast tier is under pressure. This migration - is performed before swap. It may move data to a NUMA - node that does not fall into the cpuset of the - allocating process which might be construed to violate - the guarantees of cpusets. This should not be enabled - on systems which need strict cpuset location + is performed before swap if an eligible numa node is + present in cpuset.mems for the cgroup. If cpusets.mems + changes at runtime, it may move data to a NUMA node that + does not fall into the cpuset of the new cpusets.mems, + which might be construed to violate the guarantees of + cpusets. Shared memory, such as libraries, owned by + another cgroup may still be demoted and result in memory + use on a node not present in cpusets.mem. This should not + be enabled on systems which need strict cpuset location guarantees. diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 835e7b793f6a..d4169f1b1719 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } +bool memcg_mems_allowed(struct mem_cgroup *memcg, int nid); + #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 0f910c828973..bb9669cc105d 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4296,3 +4296,13 @@ void cpuset_task_status_allowed(struct seq_file *m, struct task_struct *task) seq_printf(m, "Mems_allowed_list:\t%*pbl\n", nodemask_pr_args(&task->mems_allowed)); } + +bool memcg_mems_allowed(struct mem_cgroup *memcg, int nid) +{ + struct cgroup_subsys_state *css; + struct cpuset *cs; + + css = cgroup_get_e_css(memcg->css.cgroup, &cpuset_cgrp_subsys); + cs = css ? container_of(css, struct cpuset, css) : NULL; + return cs ? node_isset(nid, cs->effective_mems) : true; +} diff --git a/mm/vmscan.c b/mm/vmscan.c index 2b2ab386cab5..04152ea1c03d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -342,16 +342,22 @@ static void flush_reclaim_state(struct scan_control *sc) } } -static bool can_demote(int nid, struct scan_control *sc) +static bool can_demote(int nid, struct scan_control *sc, + struct mem_cgroup *memcg) { + int demotion_nid; + if (!numa_demotion_enabled) return false; if (sc && sc->no_demotion) return false; - if (next_demotion_node(nid) == NUMA_NO_NODE) + + demotion_nid = next_demotion_node(nid); + if (demotion_nid == NUMA_NO_NODE) return false; - return true; + /* If demotion node isn't in mems_allowed, fall back */ + return memcg ? memcg_mems_allowed(memcg, demotion_nid) : true; } static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, @@ -376,7 +382,7 @@ static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, * * Can it be reclaimed from this node via demotion? */ - return can_demote(nid, sc); + return can_demote(nid, sc, NULL); } /* @@ -1096,7 +1102,8 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask) */ static unsigned int shrink_folio_list(struct list_head *folio_list, struct pglist_data *pgdat, struct scan_control *sc, - struct reclaim_stat *stat, bool ignore_references) + struct reclaim_stat *stat, bool ignore_references, + struct mem_cgroup *memcg) { struct folio_batch free_folios; LIST_HEAD(ret_folios); @@ -1109,7 +1116,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, folio_batch_init(&free_folios); memset(stat, 0, sizeof(*stat)); cond_resched(); - do_demote_pass = can_demote(pgdat->node_id, sc); + do_demote_pass = can_demote(pgdat->node_id, sc, memcg); retry: while (!list_empty(folio_list)) { @@ -1658,7 +1665,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, */ noreclaim_flag = memalloc_noreclaim_save(); nr_reclaimed = shrink_folio_list(&clean_folios, zone->zone_pgdat, &sc, - &stat, true); + &stat, true, NULL); memalloc_noreclaim_restore(noreclaim_flag); list_splice(&clean_folios, folio_list); @@ -2031,7 +2038,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, if (nr_taken == 0) return 0; - nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false); + nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false, + lruvec_memcg(lruvec)); spin_lock_irq(&lruvec->lru_lock); move_folios_to_lru(lruvec, &folio_list); @@ -2214,7 +2222,7 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list, .no_demotion = 1, }; - nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &stat, true); + nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &stat, true, NULL); while (!list_empty(folio_list)) { folio = lru_to_folio(folio_list); list_del(&folio->lru); @@ -2654,7 +2662,7 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, return true; /* Also valuable if anon pages can be demoted: */ - return can_demote(pgdat->node_id, sc); + return can_demote(pgdat->node_id, sc, NULL); } #ifdef CONFIG_LRU_GEN @@ -2732,7 +2740,7 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) if (!sc->may_swap) return 0; - if (!can_demote(pgdat->node_id, sc) && + if (!can_demote(pgdat->node_id, sc, NULL) && mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) return 0; @@ -4695,7 +4703,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap if (list_empty(&list)) return scanned; retry: - reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false); + reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, NULL); sc->nr.unqueued_dirty += stat.nr_unqueued_dirty; sc->nr_reclaimed += reclaimed; trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,