From patchwork Mon Jul 4 07:06:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12904767 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 20F1BC433EF for ; Mon, 4 Jul 2022 07:08:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9155B900002; Mon, 4 Jul 2022 03:08:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8C5076B0075; Mon, 4 Jul 2022 03:08:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 76634900002; Mon, 4 Jul 2022 03:08:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 66AFD6B0074 for ; Mon, 4 Jul 2022 03:08:48 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 45B3661C49 for ; Mon, 4 Jul 2022 07:08:10 +0000 (UTC) X-FDA: 79648538340.09.38F9CE7 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf08.hostedemail.com (Postfix) with ESMTP id 8D376160056 for ; Mon, 4 Jul 2022 07:08:09 +0000 (UTC) Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2645pGwT003978; Mon, 4 Jul 2022 07:07:47 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-transfer-encoding : mime-version; s=pp1; bh=VROdRSEQ/S0c3jBZGx05uszC//GrM1Yj8D2HDfideE0=; b=HWb+xpVpe8c47xCwArfvoa8K6+1DbGsK+hZakH7BrZznZatQeov7TEzkaL1X53WUtUDv HXHBVi43LW5Ugs7httMIhr9Zrhhh5PQLQSMxeTVUBSxq501BmIMYtwFAApsNvNIkJLCa eKKxAPIchVEIpKT/HNXaNT03eOBNolZo532eZQu85LC5Bmb6pC7Oz2k4DF+70cHrvJi1 BOt96/VHPlWitcBR5tQT8pO2CM5KxRQhkGdC8vRY6Vj3/DQCzApEcSKQFvAkNx/PUZ+3 Yq37/Lzvia6oUoNH5VKoaM1Bb5tFM/r42/hAHKgcgm0IORdDmpZowjgrJ+RJVKKKciEz DQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3te31fxv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:07:47 +0000 Received: from m0098417.ppops.net (m0098417.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2646jTXm028984; Mon, 4 Jul 2022 07:07:46 GMT Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3te31fxb-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:07:46 +0000 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 264769GF007084; Mon, 4 Jul 2022 07:07:45 GMT Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com [9.57.198.24]) by ppma03dal.us.ibm.com with ESMTP id 3h2dn9hakx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:07:45 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26477itb65798614 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 4 Jul 2022 07:07:44 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5147F124058; Mon, 4 Jul 2022 07:07:44 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 10972124053; Mon, 4 Jul 2022 07:07:38 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.74.198]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Mon, 4 Jul 2022 07:07:37 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" , Jagdish Gediya Subject: [PATCH v8 01/12] mm/demotion: Add support for explicit memory tiers Date: Mon, 4 Jul 2022 12:36:01 +0530 Message-Id: <20220704070612.299585-2-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> References: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> X-TM-AS-GCONF: 00 X-Proofpoint-GUID: -W6Ne9n39Qj_PtAwy1qJxyQ3QufGhdL5 X-Proofpoint-ORIG-GUID: L_Phx5VKpLsfSGtCbqPrFp6rRQqL0ko5 X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-04_05,2022-06-28_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 phishscore=0 spamscore=0 mlxscore=0 malwarescore=0 lowpriorityscore=0 mlxlogscore=999 clxscore=1015 priorityscore=1501 adultscore=0 bulkscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2207040030 ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=HWb+xpVp; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf08.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656918489; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VROdRSEQ/S0c3jBZGx05uszC//GrM1Yj8D2HDfideE0=; b=dFGB1efZUiWnajT5b8HQWmOaVtiPwlz/YeguQWmoGt1/l+gtqAUA3zxlDQVOTjnInHl0i4 j1T1HMeLJyqNfBH/Y3B/hXaiYZfZ10hYIM4SOGYo5FS2GmrCkp8+hVI2JxRkrkySDDPmWa iC2nQLlIdmOUAxkFhrGDeALM/gfItu8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656918489; a=rsa-sha256; cv=none; b=2sw6+2/V8V0MQ0kL7LsF/gLqX9bX5nTUbjPNF+I/mk89PILPc3BWZeeXrIailY8o2qogSi GzCYq/a6Vz2W4KPXvkoUt+zm0YODBF3Yl9wLucY7ndJhFDE/pYZL9/zqVOlxiq4/R1OP8d 4n1bECtBCCkto7RWA/xmN2V6OZfbIK8= X-Stat-Signature: 4bkjgmjjjdepfap1btssxrsgnszeqy98 X-Rspamd-Queue-Id: 8D376160056 Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=HWb+xpVp; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf08.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam03 X-Rspam-User: X-HE-Tag: 1656918489-31707 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In the current kernel, memory tiers are defined implicitly via a demotion path relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the top tier, and builds the tier hierarchy tier-by-tier by establishing the per-node demotion targets based on the distances between nodes. This current memory tier kernel interface needs to be improved for several important use cases, The current tier initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM device attached via CXL.mem or a DRAM-backed memory-only node on a virtual machine) and should be put into a higher tier. The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier. With current kernel higher tier node can only be demoted to selected nodes on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, hard-coded demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that. The current kernel also don't provide any interfaces for the userspace to learn about the memory tier hierarchy in order to optimize its memory allocations. This patch series address the above by defining memory tiers explicitly. This patch introduce explicity memory tiers. The tier ID value of a memory tier is used to derive the demotion order between NUMA nodes. For example, if we have 3 memtiers: memtier100, memtier200, memiter300 then the memory tier order is: memtier300 -> memtier200 -> memtier100 where memtier300 is the highest tier and memtier100 is the lowest tier. While reclaim we migrate pages from fast(higher) tiers to slow(lower) tiers when the fast(higher) tier is under memory pressure. This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300) which are created by different kernel subsystems. The default memory tier created by the kernel is memtier200. A kernel parameter is provided to override the default memory tier. Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 15 +++++++ mm/Makefile | 1 + mm/memory-tiers.c | 78 ++++++++++++++++++++++++++++++++++++ 3 files changed, 94 insertions(+) create mode 100644 include/linux/memory-tiers.h create mode 100644 mm/memory-tiers.c diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h new file mode 100644 index 000000000000..a81dbc20e0d1 --- /dev/null +++ b/include/linux/memory-tiers.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMORY_TIERS_H +#define _LINUX_MEMORY_TIERS_H + +#ifdef CONFIG_NUMA + +#define MEMORY_TIER_HBM_GPU 300 +#define MEMORY_TIER_DRAM 200 +#define MEMORY_TIER_PMEM 100 + +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM +#define MAX_MEMORY_TIER_ID 400 + +#endif /* CONFIG_NUMA */ +#endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/Makefile b/mm/Makefile index 6f9ffa968a1a..d30acebc2164 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o +obj-$(CONFIG_NUMA) += memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c new file mode 100644 index 000000000000..69a5d81c0a12 --- /dev/null +++ b/mm/memory-tiers.c @@ -0,0 +1,78 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include + +struct memory_tier { + struct list_head list; + nodemask_t nodelist; + int id; +}; + +static DEFINE_MUTEX(memory_tier_lock); +static LIST_HEAD(memory_tiers); + +static void insert_memory_tier(struct memory_tier *memtier) +{ + struct list_head *ent; + struct memory_tier *tmp_memtier; + + lockdep_assert_held_once(&memory_tier_lock); + + list_for_each(ent, &memory_tiers) { + tmp_memtier = list_entry(ent, struct memory_tier, list); + if (tmp_memtier->id < memtier->id) { + list_add_tail(&memtier->list, ent); + return; + } + } + list_add_tail(&memtier->list, &memory_tiers); +} + +static struct memory_tier *register_memory_tier(unsigned int tier) +{ + struct memory_tier *memtier; + + if (tier > MAX_MEMORY_TIER_ID) + return ERR_PTR(-EINVAL); + + memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!memtier) + return ERR_PTR(-ENOMEM); + + memtier->id = tier; + + insert_memory_tier(memtier); + + return memtier; +} + +static unsigned int default_memtier = DEFAULT_MEMORY_TIER; +core_param(default_memory_tier, default_memtier, uint, 0644); + +static int __init memory_tier_init(void) +{ + struct memory_tier *memtier; + + /* + * Register only default memory tier to hide all empty + * memory tier from sysfs. Since this is early during + * boot, we could avoid holding memtory_tier_lock. But + * keep it simple by holding locks. So we can add lock + * held debug checks in other functions. + */ + mutex_lock(&memory_tier_lock); + memtier = register_memory_tier(default_memtier); + if (IS_ERR(memtier)) + panic("%s() failed to register memory tier: %ld\n", + __func__, PTR_ERR(memtier)); + + /* CPU only nodes are not part of memory tiers. */ + memtier->nodelist = node_states[N_MEMORY]; + mutex_unlock(&memory_tier_lock); + return 0; +} +subsys_initcall(memory_tier_init); From patchwork Mon Jul 4 07:06:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12904778 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E70F9C43334 for ; Mon, 4 Jul 2022 07:14:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 67F4A900002; Mon, 4 Jul 2022 03:14:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 62F796B0078; Mon, 4 Jul 2022 03:14:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4D0D6900002; Mon, 4 Jul 2022 03:14:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 3A9226B0074 for ; Mon, 4 Jul 2022 03:14:49 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 15DD2218BB for ; Mon, 4 Jul 2022 07:08:09 +0000 (UTC) X-FDA: 79648538340.11.7A50BCE Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf07.hostedemail.com (Postfix) with ESMTP id 3D83E400AD for ; Mon, 4 Jul 2022 07:08:08 +0000 (UTC) Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2646lSCN020206; Mon, 4 Jul 2022 07:07:54 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=kCshwDR5ZogYTrpobosIp/kRpNpnP8mVfIYJFE0RrSk=; b=kxGJ92QW8Ei2hXLIL8HBVZFyfjeR7Ri3U1hNBwgTsz0hAsQaPqhP52fC6RTIEjWwhumh 1hERyW2VhqFR9wGofJx0leUYnaf7VmdcukEEU8ed7uV3/IwXVOjSqg3oT2qK6NNrRO9I yab4m6DjfAfSi9pqbkoZtD52bZ27tlF+HCe1MAVrsbKnVjxXIANc+RIlT0egP9yNgu8H rQDVw5uu6u2qX0yyaIK0kkKqQIFrBCfxyey6XHsUHwyRl02ykeY4Y+1Tw+RlF/dEKV62 U94ZXBYA90PZLNGEJy/l1OwgMxv4nw+BIsAuIej0auBbK54IETwbA8LJ4mzwN2EWBfVP 7Q== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3u860dr1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:07:54 +0000 Received: from m0098396.ppops.net (m0098396.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 26474A26026316; Mon, 4 Jul 2022 07:07:53 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3u860dqm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:07:53 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26475bAP029074; Mon, 4 Jul 2022 07:07:52 GMT Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com [9.57.198.24]) by ppma01dal.us.ibm.com with ESMTP id 3h2dn9kmnf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:07:52 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26477phe64684298 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 4 Jul 2022 07:07:51 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5418D124053; Mon, 4 Jul 2022 07:07:51 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 083A3124052; Mon, 4 Jul 2022 07:07:45 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.74.198]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Mon, 4 Jul 2022 07:07:44 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v8 02/12] mm/demotion: Move memory demotion related code Date: Mon, 4 Jul 2022 12:36:02 +0530 Message-Id: <20220704070612.299585-3-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> References: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: tDzl3GTaJVeOPxo6HBERbSJPnLQfPgH_ X-Proofpoint-ORIG-GUID: RVAWOyq7KTzALlLgt1TsSTuIghJkLX99 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-04_05,2022-06-28_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 malwarescore=0 suspectscore=0 spamscore=0 priorityscore=1501 lowpriorityscore=0 bulkscore=0 impostorscore=0 clxscore=1015 adultscore=0 mlxlogscore=999 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2207040030 ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=kxGJ92QW; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf07.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656918488; a=rsa-sha256; cv=none; b=Nnuqi77CsQ9r+5XP8pe6zjweeL0TC5HlmRTPwLn3S33hmnrLmI88qFbqr8xmL+zXOx/fDu yKQc13mDGc4pAlUKNw9s3SR4r7R4BHBI9gnwa2CCVSFXtSPVQeQ67dQ2QUDnZI7szayyrU xtPv6ueSdUtUVDfTJsHYOpLNXSoRS2I= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656918488; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kCshwDR5ZogYTrpobosIp/kRpNpnP8mVfIYJFE0RrSk=; b=CNlJpLSILGjHLNQ+22n0B79MOoSGDiK54cIeEWUNB4dHDlAh69NK4WvJppqEWlVbw5OYBz jO7BgrepDkJQqLic2gjHZu8WP8EY/MBmPGll/IygwFUJeUj7bj6MXbWxeZasgcYVbjTUdO qAdCs9URpdYmLGHZNVjeLiLaZE+oHqg= Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=kxGJ92QW; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf07.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: hnwsuw6m59x441d95iw65cgusx3xpide X-Rspamd-Queue-Id: 3D83E400AD X-HE-Tag: 1656918488-47398 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This move memory demotion related code to mm/memory-tiers.c. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 7 ++++ include/linux/migrate.h | 2 -- mm/memory-tiers.c | 63 ++++++++++++++++++++++++++++++++++++ mm/migrate.c | 60 +--------------------------------- mm/vmscan.c | 1 + 5 files changed, 72 insertions(+), 61 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index a81dbc20e0d1..c47dbe381089 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -2,6 +2,8 @@ #ifndef _LINUX_MEMORY_TIERS_H #define _LINUX_MEMORY_TIERS_H +#include + #ifdef CONFIG_NUMA #define MEMORY_TIER_HBM_GPU 300 @@ -11,5 +13,10 @@ #define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM #define MAX_MEMORY_TIER_ID 400 +extern bool numa_demotion_enabled; + +#else + +#define numa_demotion_enabled false #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 069a89e847f3..43e737215f33 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -78,7 +78,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) extern void set_migration_target_nodes(void); extern void migrate_on_reclaim_init(void); -extern bool numa_demotion_enabled; extern int next_demotion_node(int node); #else static inline void set_migration_target_nodes(void) {} @@ -87,7 +86,6 @@ static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } -#define numa_demotion_enabled false #endif #ifdef CONFIG_COMPACTION diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 69a5d81c0a12..2dcf70802661 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #include +#include #include #include #include @@ -76,3 +77,65 @@ static int __init memory_tier_init(void) return 0; } subsys_initcall(memory_tier_init); + +bool numa_demotion_enabled = false; + +#ifdef CONFIG_MIGRATION +#ifdef CONFIG_SYSFS +static ssize_t numa_demotion_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%s\n", + numa_demotion_enabled ? "true" : "false"); +} + +static ssize_t numa_demotion_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + ssize_t ret; + + ret = kstrtobool(buf, &numa_demotion_enabled); + if (ret) + return ret; + + return count; +} + +static struct kobj_attribute numa_demotion_enabled_attr = + __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, + numa_demotion_enabled_store); + +static struct attribute *numa_attrs[] = { + &numa_demotion_enabled_attr.attr, + NULL, +}; + +static const struct attribute_group numa_attr_group = { + .attrs = numa_attrs, +}; + +static int __init numa_init_sysfs(void) +{ + int err; + struct kobject *numa_kobj; + + numa_kobj = kobject_create_and_add("numa", mm_kobj); + if (!numa_kobj) { + pr_err("failed to create numa kobject\n"); + return -ENOMEM; + } + err = sysfs_create_group(numa_kobj, &numa_attr_group); + if (err) { + pr_err("failed to register numa group\n"); + goto delete_obj; + } + return 0; + +delete_obj: + kobject_put(numa_kobj); + return err; +} +subsys_initcall(numa_init_sysfs); +#endif /* CONFIG_SYSFS */ +#endif diff --git a/mm/migrate.c b/mm/migrate.c index 6c1ea61f39d8..fce7d4a9e940 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2509,64 +2509,6 @@ void __init migrate_on_reclaim_init(void) set_migration_target_nodes(); cpus_read_unlock(); } +#endif /* CONFIG_NUMA */ -bool numa_demotion_enabled = false; - -#ifdef CONFIG_SYSFS -static ssize_t numa_demotion_enabled_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%s\n", - numa_demotion_enabled ? "true" : "false"); -} - -static ssize_t numa_demotion_enabled_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) -{ - ssize_t ret; - - ret = kstrtobool(buf, &numa_demotion_enabled); - if (ret) - return ret; - - return count; -} - -static struct kobj_attribute numa_demotion_enabled_attr = - __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, - numa_demotion_enabled_store); - -static struct attribute *numa_attrs[] = { - &numa_demotion_enabled_attr.attr, - NULL, -}; - -static const struct attribute_group numa_attr_group = { - .attrs = numa_attrs, -}; - -static int __init numa_init_sysfs(void) -{ - int err; - struct kobject *numa_kobj; - numa_kobj = kobject_create_and_add("numa", mm_kobj); - if (!numa_kobj) { - pr_err("failed to create numa kobject\n"); - return -ENOMEM; - } - err = sysfs_create_group(numa_kobj, &numa_attr_group); - if (err) { - pr_err("failed to register numa group\n"); - goto delete_obj; - } - return 0; - -delete_obj: - kobject_put(numa_kobj); - return err; -} -subsys_initcall(numa_init_sysfs); -#endif /* CONFIG_SYSFS */ -#endif /* CONFIG_NUMA */ diff --git a/mm/vmscan.c b/mm/vmscan.c index f7d9a683e3a7..3a8f78277f99 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -50,6 +50,7 @@ #include #include #include +#include #include #include From patchwork Mon Jul 4 07:06:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12904780 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83139C43334 for ; Mon, 4 Jul 2022 07:17:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 177396B0074; Mon, 4 Jul 2022 03:17:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1297B900003; Mon, 4 Jul 2022 03:17:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F0A19900002; Mon, 4 Jul 2022 03:17:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id E0F4E6B0074 for ; Mon, 4 Jul 2022 03:17:57 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 34E8A21EC7 for ; Mon, 4 Jul 2022 07:08:11 +0000 (UTC) X-FDA: 79648538424.20.BDA81FE Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf01.hostedemail.com (Postfix) with ESMTP id 8FDC54002D for ; Mon, 4 Jul 2022 07:08:10 +0000 (UTC) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2646kYtU001473; Mon, 4 Jul 2022 07:08:01 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=+PDy9N/grOL5ksV+lMfzpMQWbub5D2w8o7M+aMpzQxk=; b=cqwrk2Vvz+6nUSVlGDTyP/3jqigp0CDGynvTYhn0oGKg6RXzw6gOaqghvqTUib+6ZAxU 18NuxFBwXaMFscKGZTqG4k0pSFLkJnBqbFO6MUYmqxxXeWKHYR8+gujZeaHP1lfYQXZM 7KiTOuCrBeJmdN1Wi2xyfClE7vF36tMGbCeypLATx7cwpJxQJgo/8tllB1OUG+7vYtnP ilkjQNby/Ud6ZfCevwVH08RZdc6A7nI5NM2nFUg4py0eJpuAdiFyFlN0FaYBKV8OSkeK 6JLIJJJKkB8q1cMyxwvxZ8E5+ithCMMbSAiOzxpJ3DpPJqbMLo8m+gFlxpWTnC8jp5MI qw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3u7qree9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:01 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2646qUkP020406; Mon, 4 Jul 2022 07:08:00 GMT Received: from ppma05wdc.us.ibm.com (1b.90.2fa9.ip4.static.sl-reverse.com [169.47.144.27]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3u7qredk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:00 +0000 Received: from pps.filterd (ppma05wdc.us.ibm.com [127.0.0.1]) by ppma05wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26476Hfb002745; Mon, 4 Jul 2022 07:07:59 GMT Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com [9.57.198.24]) by ppma05wdc.us.ibm.com with ESMTP id 3h2dn9349g-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:07:59 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26477wWA67043818 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 4 Jul 2022 07:07:58 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6D4E8124055; Mon, 4 Jul 2022 07:07:58 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 308F0124053; Mon, 4 Jul 2022 07:07:52 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.74.198]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Mon, 4 Jul 2022 07:07:51 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" , Jagdish Gediya Subject: [PATCH v8 03/12] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Date: Mon, 4 Jul 2022 12:36:03 +0530 Message-Id: <20220704070612.299585-4-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> References: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: o695j5USkOx_mP4-gyYvhj2XvaV2wf94 X-Proofpoint-ORIG-GUID: dCyUrRb0EQK5RCTltTXMcU6xO75jXvlG X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-04_05,2022-06-28_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 lowpriorityscore=0 malwarescore=0 bulkscore=0 mlxscore=0 phishscore=0 mlxlogscore=999 adultscore=0 suspectscore=0 impostorscore=0 priorityscore=1501 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2207040030 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656918490; a=rsa-sha256; cv=none; b=DXYhREZetKQRZ5L6Drr6Zd4K7qNRV9QRbxRwEUhN7q7VKPR+QfwQmoLnkJFabmEPW6DmJg 5cgUqADRNViYMaVQhfdmCg6rNLceyXJyo5pAMDtV3CVCkAcn7Ey5wadzQpjPtwbdnFkkX9 pgtD4SZs5Bae2/94dJzQVeVwT7vn9/w= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=cqwrk2Vv; spf=pass (imf01.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656918490; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+PDy9N/grOL5ksV+lMfzpMQWbub5D2w8o7M+aMpzQxk=; b=CHWzxSEa7U/5Ixv7rKlYCKfCO7IYBJmBzrmeS9BxVCVFneztZpVd8i4M7U3en9m6jGXwAe GgOI96FyYtW5ob9wdixGm80ZyP8ZetLIecfQFeLXvz7syIWv35uUwZjWAdHj0KbFWPaRZe 2w/7F8fIf3HZbZub/te0v5Owx9y8e1M= X-Stat-Signature: ox4uhgzf3cnhgj35ig8e66dehb7j4kkp X-Rspamd-Queue-Id: 8FDC54002D Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=cqwrk2Vv; spf=pass (imf01.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1656918490-7349 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: By default, all nodes are assigned to DEFAULT_MEMORY_TIER which is the memory tier designated for nodes with DRAM Set dax kmem device node's tier to MEMORY_TIER_PMEM. MEMORY_TIER_PMEM appears below DEFAULT_MEMORY_TIER in demotion order. Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- drivers/dax/kmem.c | 6 ++- include/linux/memory-tiers.h | 5 +++ mm/memory-tiers.c | 79 ++++++++++++++++++++++++++++++++++++ 3 files changed, 89 insertions(+), 1 deletion(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index a37622060fff..0c03889286ac 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -11,6 +11,7 @@ #include #include #include +#include #include "dax-private.h" #include "bus.h" @@ -41,6 +42,9 @@ struct dax_kmem_data { struct resource *res[]; }; +static unsigned int dax_kmem_memtier = MEMORY_TIER_PMEM; +module_param(dax_kmem_memtier, uint, 0644); + static int dev_dax_kmem_probe(struct dev_dax *dev_dax) { struct device *dev = &dev_dax->dev; @@ -146,7 +150,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) } dev_set_drvdata(dev, data); - + node_create_and_set_memory_tier(numa_node, dax_kmem_memtier); return 0; err_request_mem: diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index c47dbe381089..9d36ff13c954 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -14,9 +14,14 @@ #define MAX_MEMORY_TIER_ID 400 extern bool numa_demotion_enabled; +int node_create_and_set_memory_tier(int node, int tier); #else #define numa_demotion_enabled false +static inline int node_create_and_set_memory_tier(int node, int tier) +{ + return 0; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 2dcf70802661..fc404fcff7ff 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -51,6 +51,85 @@ static struct memory_tier *register_memory_tier(unsigned int tier) return memtier; } +static void unregister_memory_tier(struct memory_tier *memtier) +{ + list_del(&memtier->list); + kfree(memtier); +} + +static struct memory_tier *__node_get_memory_tier(int node) +{ + struct memory_tier *memtier; + + list_for_each_entry(memtier, &memory_tiers, list) { + if (node_isset(node, memtier->nodelist)) + return memtier; + } + return NULL; +} + +static struct memory_tier *__get_memory_tier_from_id(int id) +{ + struct memory_tier *memtier; + + list_for_each_entry(memtier, &memory_tiers, list) { + if (memtier->id == id) + return memtier; + } + return NULL; +} + +static int __node_create_and_set_memory_tier(int node, int tier) +{ + int ret = 0; + struct memory_tier *memtier; + + memtier = __get_memory_tier_from_id(tier); + if (!memtier) { + memtier = register_memory_tier(tier); + if (IS_ERR(memtier)) { + ret = -EINVAL; + goto out; + } + } + node_set(node, memtier->nodelist); +out: + return ret; +} + +int node_create_and_set_memory_tier(int node, int tier) +{ + struct memory_tier *current_tier; + int ret = 0; + + mutex_lock(&memory_tier_lock); + + current_tier = __node_get_memory_tier(node); + if (!current_tier) { + ret = __node_create_and_set_memory_tier(node, tier); + goto out; + } + + if (current_tier->id == tier) + goto out; + + node_clear(node, current_tier->nodelist); + + ret = __node_create_and_set_memory_tier(node, tier); + if (ret) { + /* reset it back to older tier */ + node_set(node, current_tier->nodelist); + goto out; + } + if (nodes_empty(current_tier->nodelist)) + unregister_memory_tier(current_tier); +out: + mutex_unlock(&memory_tier_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(node_create_and_set_memory_tier); + static unsigned int default_memtier = DEFAULT_MEMORY_TIER; core_param(default_memory_tier, default_memtier, uint, 0644); From patchwork Mon Jul 4 07:06:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12904779 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96C55C433EF for ; Mon, 4 Jul 2022 07:15:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3C226900002; Mon, 4 Jul 2022 03:15:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 370E86B0078; Mon, 4 Jul 2022 03:15:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 21182900002; Mon, 4 Jul 2022 03:15:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 1103B6B0074 for ; Mon, 4 Jul 2022 03:15:38 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 1E85A617AA for ; Mon, 4 Jul 2022 07:08:22 +0000 (UTC) X-FDA: 79648538844.05.BB4184C Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf02.hostedemail.com (Postfix) with ESMTP id 902FC800C3 for ; Mon, 4 Jul 2022 07:08:21 +0000 (UTC) Received: from pps.filterd (m0098416.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2645ok7M020145; Mon, 4 Jul 2022 07:08:07 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=VynPHaB44iwLNm1NI/jimLhyT7ZD9Z8XYUiK5QEqD2Y=; b=arAGugfnK+sZGy+OTe08BErvi5QURoAsTLl/lxWzH+lpZ375eT1SerSAvMdDU9Vx8V3J AC6T8HX4hpAP0mqkeNIA4ZvrMDwUvj/4sXZC7iqJFoIzq2EHhPzn6dyK92efVjUUhaRT 3PSBh+Kqm8MPz6tu6szC6Ec8GIro+VvorrA9rVp82z9vuwWd3cAT5cuJOTwQEIC6w20X V5/SaafFik448tVhdYA96TFpXdDcRTYMLP6WIKSvstEp2MkGrnYc4KmX7jtIALykKIU2 g9zJt8QZVjStnc7/xvIklI9bri7tNvd5QXFqHkkHcI7A0n+AHND6QkosPC47sesWhkBe 6Q== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3tdw1jqw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:07 +0000 Received: from m0098416.ppops.net (m0098416.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2646TDaT018292; Mon, 4 Jul 2022 07:08:07 GMT Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3tdw1jqg-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:06 +0000 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26475TmZ030738; Mon, 4 Jul 2022 07:08:06 GMT Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com [9.57.198.24]) by ppma04wdc.us.ibm.com with ESMTP id 3h2dn9b4es-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:06 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 264785rR64684344 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 4 Jul 2022 07:08:05 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 701A312405A; Mon, 4 Jul 2022 07:08:05 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0EBD0124052; Mon, 4 Jul 2022 07:07:59 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.74.198]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Mon, 4 Jul 2022 07:07:58 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v8 04/12] mm/demotion: Add hotplug callbacks to handle new numa node onlined Date: Mon, 4 Jul 2022 12:36:04 +0530 Message-Id: <20220704070612.299585-5-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> References: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: u4SUYerLbGMNatFRZwC6xBmkxPhOHZMb X-Proofpoint-ORIG-GUID: opr2ipzNiIAs3eU8bJubdp00sqfbHjGo X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-04_05,2022-06-28_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 priorityscore=1501 mlxscore=0 adultscore=0 impostorscore=0 lowpriorityscore=0 suspectscore=0 phishscore=0 malwarescore=0 spamscore=0 bulkscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2207040030 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656918501; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VynPHaB44iwLNm1NI/jimLhyT7ZD9Z8XYUiK5QEqD2Y=; b=r3n3zSYzKbwNh1dH6aTQPySIJxJmiHOYhT8CbCh6lMdybYUPFSz3O4Kzdw5TI8hm5q/DUh ix1ST97V3EJO89XUpc0glgUc6BkWsUh6dmpiHHLCXdn9h2609+3J8AHYIEEczMtV6rW2JS tcXY3eZhiES+dOCavCYsMojFRtadORU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656918501; a=rsa-sha256; cv=none; b=3JvZCy748suOEb+N+3RYxNy+3K9YxgK4XpBXtGSTVz+W4bdL1EbiI4EgwShf1/qf/10lT/ BpWQq0NCc2YROJmBOOOtl7bopEWKmIQNSuuRuAEzkZ1uHtlzUYns9wXO5re2Fe/55iRYu8 mD/nxTBhN4yZUKUhbDiGg0PkWfZO4dA= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=arAGugfn; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf02.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Stat-Signature: 4harhm8ofsoaypyqs57ctixp3r5who9o X-Rspamd-Queue-Id: 902FC800C3 X-Rspam-User: Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=arAGugfn; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf02.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam06 X-HE-Tag: 1656918501-984300 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: If the new NUMA node onlined doesn't have a memory tier assigned, the kernel adds the NUMA node to default memory tier. Signed-off-by: Aneesh Kumar K.V --- mm/memory-tiers.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 68 insertions(+) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index fc404fcff7ff..2147112981a6 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -5,6 +5,7 @@ #include #include #include +#include #include struct memory_tier { @@ -130,8 +131,73 @@ int node_create_and_set_memory_tier(int node, int tier) } EXPORT_SYMBOL_GPL(node_create_and_set_memory_tier); +static int __node_set_memory_tier(int node, int tier) +{ + int ret = 0; + struct memory_tier *memtier; + + memtier = __get_memory_tier_from_id(tier); + if (!memtier) { + ret = -EINVAL; + goto out; + } + node_set(node, memtier->nodelist); +out: + return ret; +} + +static int node_set_memory_tier(int node, int tier) +{ + struct memory_tier *memtier; + int ret = 0; + + mutex_lock(&memory_tier_lock); + memtier = __node_get_memory_tier(node); + if (!memtier) + ret = __node_set_memory_tier(node, tier); + + mutex_unlock(&memory_tier_lock); + + return ret; +} + static unsigned int default_memtier = DEFAULT_MEMORY_TIER; core_param(default_memory_tier, default_memtier, uint, 0644); +/* + * This runs whether reclaim-based migration is enabled or not, + * which ensures that the user can turn reclaim-based migration + * at any time without needing to recalculate migration targets. + */ +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, + unsigned long action, void *_arg) +{ + struct memory_notify *arg = _arg; + + /* + * Only update the node migration order when a node is + * changing status, like online->offline. + */ + if (arg->status_change_nid < 0) + return notifier_from_errno(0); + + switch (action) { + case MEM_ONLINE: + /* + * We ignore the error here, if the node already have the tier + * registered, we will continue to use that for the new memory + * we are adding here. + */ + node_set_memory_tier(arg->status_change_nid, default_memtier); + break; + } + + return notifier_from_errno(0); +} + +static void __init migrate_on_reclaim_init(void) +{ + hotplug_memory_notifier(migrate_on_reclaim_callback, 100); +} static int __init memory_tier_init(void) { @@ -153,6 +219,8 @@ static int __init memory_tier_init(void) /* CPU only nodes are not part of memory tiers. */ memtier->nodelist = node_states[N_MEMORY]; mutex_unlock(&memory_tier_lock); + + migrate_on_reclaim_init(); return 0; } subsys_initcall(memory_tier_init); From patchwork Mon Jul 4 07:06:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12904770 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD713C43334 for ; Mon, 4 Jul 2022 07:10:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7DD00900002; Mon, 4 Jul 2022 03:10:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 78C756B0078; Mon, 4 Jul 2022 03:10:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 607A0900002; Mon, 4 Jul 2022 03:10:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 50C466B0074 for ; Mon, 4 Jul 2022 03:10:24 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 144CD35266 for ; Mon, 4 Jul 2022 07:08:28 +0000 (UTC) X-FDA: 79648539138.02.DE8A172 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf23.hostedemail.com (Postfix) with ESMTP id D8EAA140060 for ; Mon, 4 Jul 2022 07:08:26 +0000 (UTC) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2646kSUe001328; Mon, 4 Jul 2022 07:08:16 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=WlQGVcmeYXcsCBindSpYWJ/8YYdVVchAfYmhxEHP9/Y=; b=Rx+g8MvTDAhEOEDT8glQTHZIpNAivY+5cPIPQvg54cGD16bfMbMeYkJT+V+QdZG0fS6w Nlv6j58Csj8p14BcGPl3YKrD8OJv2vH20PUOpr/e+pkOxpdCSnac3qeDCKAzQsm249fs YuQsgOIl9VbOKUKnuMu+jbgxsQrDxSQFBrb1IuvwTHGo8RRwFZZ1Bq09T6Tccwu4Haz/ Dh6GCgoOyWsm9aUDJ6NJvlFrUlFHvXyb+4Um1UuoNr7CoP/6ppbnQlC7d4+jKxHUK0g8 NQAcPxi38mq9HAnfR9HGVwHYAs/4vm45HMTLi42oJzHSJfOcTRwhMA2/lpDnCj7ux5RA Rg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3u7qrek0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:15 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2646kqti001710; Mon, 4 Jul 2022 07:08:15 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3u7qrejh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:14 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26475ZKc029069; Mon, 4 Jul 2022 07:08:13 GMT Received: from b01cxnp22035.gho.pok.ibm.com (b01cxnp22035.gho.pok.ibm.com [9.57.198.25]) by ppma01dal.us.ibm.com with ESMTP id 3h2dn9kmpk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:13 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22035.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26478D8V37945744 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 4 Jul 2022 07:08:13 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F1C0C124053; Mon, 4 Jul 2022 07:08:12 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 38D5A124052; Mon, 4 Jul 2022 07:08:06 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.74.198]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Mon, 4 Jul 2022 07:08:05 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v8 05/12] mm/demotion: Build demotion targets based on explicit memory tiers Date: Mon, 4 Jul 2022 12:36:05 +0530 Message-Id: <20220704070612.299585-6-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> References: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: n16ucIB14ZTOukrfZLZFu2dFAmI9Smss X-Proofpoint-ORIG-GUID: B3oekHMv4inFku6HcqBRWCLHlW8D3vql X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-04_05,2022-06-28_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 lowpriorityscore=0 malwarescore=0 bulkscore=0 mlxscore=0 phishscore=0 mlxlogscore=999 adultscore=0 suspectscore=0 impostorscore=0 priorityscore=1501 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2207040030 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656918507; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WlQGVcmeYXcsCBindSpYWJ/8YYdVVchAfYmhxEHP9/Y=; b=ZgSC3BTFQUBKqnrAoKMX+SCR/j9bepmq/vJ84rgQgEa4CklOKkyP9LcTR/T75Js9Qg94fN NlQaXewb49YFMytRZAFAOhp7sufbKP7NFMVNNSAKKX7uj8XeBXEpFHUEJDhrhnIyUxA8OY tGMbagNlERWb5+/PXNKu/tjwbG0jucE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656918507; a=rsa-sha256; cv=none; b=kCRmdEthdNOxb/NaJYz9I1zOlXL285XyrzxbvdBkLgjYa/akUCB1dYMV1mn/LLPsjK9ccQ vWPMDvw7bepayqqqL3Hrg9nh6JIPDZiR32WyWad6CnMs9UVTb5evZTvFwFQkp41A9+d4xG Lvrx/OpQ8+E0xT4Z/TdEBLxSRMU1mmU= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=Rx+g8MvT; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf23.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Stat-Signature: 8wqk5z4bh94yefrxf5tpyicz3poq75pb X-Rspamd-Queue-Id: D8EAA140060 X-Rspam-User: Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=Rx+g8MvT; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf23.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam06 X-HE-Tag: 1656918506-437964 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch switch the demotion target building logic to use memory tiers instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the default tier 200 and additional memory tiers will be added by drivers like dax kmem. This patch builds the demotion target for a NUMA node by looking at all memory tiers below the tier to which the NUMA node belongs. The closest node in the immediately following memory tier is used as a demotion target. Since we are now only building demotion target for N_MEMORY NUMA nodes the CPU hotplug calls are removed in this patch. A new memory tier can be inserted into the tier hierarchy for a new set of nodes without affecting the node assignment of any existing memtier, provided that there is enough gap in the tier ID values for the new memtier. The absolute value of tier ID of a memtier doesn't necessarily carry any meaning. Its value relative to other memtiers decides the level of this memtier in the tier hierarchy. For now, This patch supports hardcoded tier ID values which are 300, 200 and 100 for memory tiers. Suggested-by: Wei Xu Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 13 ++ include/linux/migrate.h | 13 -- mm/memory-tiers.c | 227 ++++++++++++++++++++ mm/migrate.c | 394 ----------------------------------- mm/vmstat.c | 4 - 5 files changed, 240 insertions(+), 411 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 9d36ff13c954..3234301c2537 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -15,6 +15,14 @@ extern bool numa_demotion_enabled; int node_create_and_set_memory_tier(int node, int tier); +#ifdef CONFIG_MIGRATION +int next_demotion_node(int node); +#else +static inline int next_demotion_node(int node) +{ + return NUMA_NO_NODE; +} +#endif #else @@ -23,5 +31,10 @@ static inline int node_create_and_set_memory_tier(int node, int tier) { return 0; } + +static inline int next_demotion_node(int node) +{ + return NUMA_NO_NODE; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 43e737215f33..93fab62e6548 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #endif /* CONFIG_MIGRATION */ -#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) -extern void set_migration_target_nodes(void); -extern void migrate_on_reclaim_init(void); -extern int next_demotion_node(int node); -#else -static inline void set_migration_target_nodes(void) {} -static inline void migrate_on_reclaim_init(void) {} -static inline int next_demotion_node(int node) -{ - return NUMA_NO_NODE; -} -#endif - #ifdef CONFIG_COMPACTION extern int PageMovable(struct page *page); extern void __SetPageMovable(struct page *page, struct address_space *mapping); diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 2147112981a6..0596f0b11065 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -6,16 +6,85 @@ #include #include #include +#include #include +#include "internal.h" + struct memory_tier { struct list_head list; nodemask_t nodelist; int id; }; +struct demotion_nodes { + nodemask_t preferred; +}; + +static void establish_migration_targets(void); static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); +/* + * node_demotion[] examples: + * + * Example 1: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes. + * + * node distances: + * node 0 1 2 3 + * 0 10 20 30 40 + * 1 20 10 40 30 + * 2 30 40 10 40 + * 3 40 30 40 10 + * + * memory_tiers[0] = + * memory_tiers[1] = 0-1 + * memory_tiers[2] = 2-3 + * + * node_demotion[0].preferred = 2 + * node_demotion[1].preferred = 3 + * node_demotion[2].preferred = + * node_demotion[3].preferred = + * + * Example 2: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 30 + * 2 30 30 10 + * + * memory_tiers[0] = + * memory_tiers[1] = 0-2 + * memory_tiers[2] = + * + * node_demotion[0].preferred = + * node_demotion[1].preferred = + * node_demotion[2].preferred = + * + * Example 3: + * + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 40 + * 2 30 40 10 + * + * memory_tiers[0] = 1 + * memory_tiers[1] = 0 + * memory_tiers[2] = 2 + * + * node_demotion[0].preferred = 2 + * node_demotion[1].preferred = 0 + * node_demotion[2].preferred = + * + */ +static struct demotion_nodes *node_demotion __read_mostly; static void insert_memory_tier(struct memory_tier *memtier) { @@ -108,6 +177,7 @@ int node_create_and_set_memory_tier(int node, int tier) current_tier = __node_get_memory_tier(node); if (!current_tier) { ret = __node_create_and_set_memory_tier(node, tier); + establish_migration_targets(); goto out; } @@ -124,6 +194,8 @@ int node_create_and_set_memory_tier(int node, int tier) } if (nodes_empty(current_tier->nodelist)) unregister_memory_tier(current_tier); + + establish_migration_targets(); out: mutex_unlock(&memory_tier_lock); @@ -153,14 +225,152 @@ static int node_set_memory_tier(int node, int tier) mutex_lock(&memory_tier_lock); memtier = __node_get_memory_tier(node); + /* + * if node is already part of the tier proceed with the + * current tier value, because we might want to establish + * new migration paths now. The node might be added to a tier + * before it was made part of N_MEMORY, hence estabilish_migration_targets + * will have skipped this node. + */ if (!memtier) ret = __node_set_memory_tier(node, tier); + establish_migration_targets(); mutex_unlock(&memory_tier_lock); return ret; } +#ifdef CONFIG_MIGRATION +/** + * next_demotion_node() - Get the next node in the demotion path + * @node: The starting node to lookup the next node + * + * Return: node id for next memory node in the demotion path hierarchy + * from @node; NUMA_NO_NODE if @node is terminal. This does not keep + * @node online or guarantee that it *continues* to be the next demotion + * target. + */ +int next_demotion_node(int node) +{ + struct demotion_nodes *nd; + int target; + + if (!node_demotion) + return NUMA_NO_NODE; + + nd = &node_demotion[node]; + + /* + * node_demotion[] is updated without excluding this + * function from running. + * + * Make sure to use RCU over entire code blocks if + * node_demotion[] reads need to be consistent. + */ + rcu_read_lock(); + /* + * If there are multiple target nodes, just select one + * target node randomly. + * + * In addition, we can also use round-robin to select + * target node, but we should introduce another variable + * for node_demotion[] to record last selected target node, + * that may cause cache ping-pong due to the changing of + * last target node. Or introducing per-cpu data to avoid + * caching issue, which seems more complicated. So selecting + * target node randomly seems better until now. + */ + target = node_random(&nd->preferred); + rcu_read_unlock(); + + return target; +} + +/* Disable reclaim-based migration. */ +static void __disable_all_migrate_targets(void) +{ + int node; + + for_each_node_state(node, N_MEMORY) + node_demotion[node].preferred = NODE_MASK_NONE; +} + +static void disable_all_migrate_targets(void) +{ + __disable_all_migrate_targets(); + + /* + * Ensure that the "disable" is visible across the system. + * Readers will see either a combination of before+disable + * state or disable+after. They will never see before and + * after state together. + */ + synchronize_rcu(); +} +#else +static void disable_all_migrate_targets(void) {} +#endif + +/* + * Find an automatic demotion target for all memory + * nodes. Failing here is OK. It might just indicate + * being at the end of a chain. + */ +static void establish_migration_targets(void) +{ + struct memory_tier *memtier; + struct demotion_nodes *nd; + int target = NUMA_NO_NODE, node; + int distance, best_distance; + nodemask_t used; + + if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) + return; + + disable_all_migrate_targets(); + + for_each_node_state(node, N_MEMORY) { + best_distance = -1; + nd = &node_demotion[node]; + + memtier = __node_get_memory_tier(node); + if (!memtier || list_is_last(&memtier->list, &memory_tiers)) + continue; + /* + * Get the next memtier to find the demotion node list. + */ + memtier = list_next_entry(memtier, list); + + /* + * find_next_best_node, use 'used' nodemask as a skip list. + * Add all memory nodes except the selected memory tier + * nodelist to skip list so that we find the best node from the + * memtier nodelist. + */ + nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist); + + /* + * Find all the nodes in the memory tier node list of same best distance. + * add them to the preferred mask. We randomly select between nodes + * in the preferred mask when allocating pages during demotion. + */ + do { + target = find_next_best_node(node, &used); + if (target == NUMA_NO_NODE) + break; + + distance = node_distance(node, target); + if (distance == best_distance || best_distance == -1) { + best_distance = distance; + node_set(target, nd->preferred); + } else { + break; + } + } while (1); + } +} + static unsigned int default_memtier = DEFAULT_MEMORY_TIER; core_param(default_memory_tier, default_memtier, uint, 0644); /* @@ -181,6 +391,17 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, return notifier_from_errno(0); switch (action) { + case MEM_OFFLINE: + /* + * In case we are moving out of N_MEMORY. Keep the node + * in the memory tier so that when we bring memory online, + * they appear in the right memory tier. We still need + * to rebuild the demotion order. + */ + mutex_lock(&memory_tier_lock); + establish_migration_targets(); + mutex_unlock(&memory_tier_lock); + break; case MEM_ONLINE: /* * We ignore the error here, if the node already have the tier @@ -196,6 +417,12 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, static void __init migrate_on_reclaim_init(void) { + + if (IS_ENABLED(CONFIG_MIGRATION)) { + node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes), + GFP_KERNEL); + WARN_ON(!node_demotion); + } hotplug_memory_notifier(migrate_on_reclaim_callback, 100); } diff --git a/mm/migrate.c b/mm/migrate.c index fce7d4a9e940..c758c9c21d7d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2117,398 +2117,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, return 0; } #endif /* CONFIG_NUMA_BALANCING */ - -/* - * node_demotion[] example: - * - * Consider a system with two sockets. Each socket has - * three classes of memory attached: fast, medium and slow. - * Each memory class is placed in its own NUMA node. The - * CPUs are placed in the node with the "fast" memory. The - * 6 NUMA nodes (0-5) might be split among the sockets like - * this: - * - * Socket A: 0, 1, 2 - * Socket B: 3, 4, 5 - * - * When Node 0 fills up, its memory should be migrated to - * Node 1. When Node 1 fills up, it should be migrated to - * Node 2. The migration path start on the nodes with the - * processors (since allocations default to this node) and - * fast memory, progress through medium and end with the - * slow memory: - * - * 0 -> 1 -> 2 -> stop - * 3 -> 4 -> 5 -> stop - * - * This is represented in the node_demotion[] like this: - * - * { nr=1, nodes[0]=1 }, // Node 0 migrates to 1 - * { nr=1, nodes[0]=2 }, // Node 1 migrates to 2 - * { nr=0, nodes[0]=-1 }, // Node 2 does not migrate - * { nr=1, nodes[0]=4 }, // Node 3 migrates to 4 - * { nr=1, nodes[0]=5 }, // Node 4 migrates to 5 - * { nr=0, nodes[0]=-1 }, // Node 5 does not migrate - * - * Moreover some systems may have multiple slow memory nodes. - * Suppose a system has one socket with 3 memory nodes, node 0 - * is fast memory type, and node 1/2 both are slow memory - * type, and the distance between fast memory node and slow - * memory node is same. So the migration path should be: - * - * 0 -> 1/2 -> stop - * - * This is represented in the node_demotion[] like this: - * { nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2 - * { nr=0, nodes[0]=-1, }, // Node 1 dose not migrate - * { nr=0, nodes[0]=-1, }, // Node 2 does not migrate - */ - -/* - * Writes to this array occur without locking. Cycles are - * not allowed: Node X demotes to Y which demotes to X... - * - * If multiple reads are performed, a single rcu_read_lock() - * must be held over all reads to ensure that no cycles are - * observed. - */ -#define DEFAULT_DEMOTION_TARGET_NODES 15 - -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES -#define DEMOTION_TARGET_NODES (MAX_NUMNODES - 1) -#else -#define DEMOTION_TARGET_NODES DEFAULT_DEMOTION_TARGET_NODES -#endif - -struct demotion_nodes { - unsigned short nr; - short nodes[DEMOTION_TARGET_NODES]; -}; - -static struct demotion_nodes *node_demotion __read_mostly; - -/** - * next_demotion_node() - Get the next node in the demotion path - * @node: The starting node to lookup the next node - * - * Return: node id for next memory node in the demotion path hierarchy - * from @node; NUMA_NO_NODE if @node is terminal. This does not keep - * @node online or guarantee that it *continues* to be the next demotion - * target. - */ -int next_demotion_node(int node) -{ - struct demotion_nodes *nd; - unsigned short target_nr, index; - int target; - - if (!node_demotion) - return NUMA_NO_NODE; - - nd = &node_demotion[node]; - - /* - * node_demotion[] is updated without excluding this - * function from running. RCU doesn't provide any - * compiler barriers, so the READ_ONCE() is required - * to avoid compiler reordering or read merging. - * - * Make sure to use RCU over entire code blocks if - * node_demotion[] reads need to be consistent. - */ - rcu_read_lock(); - target_nr = READ_ONCE(nd->nr); - - switch (target_nr) { - case 0: - target = NUMA_NO_NODE; - goto out; - case 1: - index = 0; - break; - default: - /* - * If there are multiple target nodes, just select one - * target node randomly. - * - * In addition, we can also use round-robin to select - * target node, but we should introduce another variable - * for node_demotion[] to record last selected target node, - * that may cause cache ping-pong due to the changing of - * last target node. Or introducing per-cpu data to avoid - * caching issue, which seems more complicated. So selecting - * target node randomly seems better until now. - */ - index = get_random_int() % target_nr; - break; - } - - target = READ_ONCE(nd->nodes[index]); - -out: - rcu_read_unlock(); - return target; -} - -/* Disable reclaim-based migration. */ -static void __disable_all_migrate_targets(void) -{ - int node, i; - - if (!node_demotion) - return; - - for_each_online_node(node) { - node_demotion[node].nr = 0; - for (i = 0; i < DEMOTION_TARGET_NODES; i++) - node_demotion[node].nodes[i] = NUMA_NO_NODE; - } -} - -static void disable_all_migrate_targets(void) -{ - __disable_all_migrate_targets(); - - /* - * Ensure that the "disable" is visible across the system. - * Readers will see either a combination of before+disable - * state or disable+after. They will never see before and - * after state together. - * - * The before+after state together might have cycles and - * could cause readers to do things like loop until this - * function finishes. This ensures they can only see a - * single "bad" read and would, for instance, only loop - * once. - */ - synchronize_rcu(); -} - -/* - * Find an automatic demotion target for 'node'. - * Failing here is OK. It might just indicate - * being at the end of a chain. - */ -static int establish_migrate_target(int node, nodemask_t *used, - int best_distance) -{ - int migration_target, index, val; - struct demotion_nodes *nd; - - if (!node_demotion) - return NUMA_NO_NODE; - - nd = &node_demotion[node]; - - migration_target = find_next_best_node(node, used); - if (migration_target == NUMA_NO_NODE) - return NUMA_NO_NODE; - - /* - * If the node has been set a migration target node before, - * which means it's the best distance between them. Still - * check if this node can be demoted to other target nodes - * if they have a same best distance. - */ - if (best_distance != -1) { - val = node_distance(node, migration_target); - if (val > best_distance) - goto out_clear; - } - - index = nd->nr; - if (WARN_ONCE(index >= DEMOTION_TARGET_NODES, - "Exceeds maximum demotion target nodes\n")) - goto out_clear; - - nd->nodes[index] = migration_target; - nd->nr++; - - return migration_target; -out_clear: - node_clear(migration_target, *used); - return NUMA_NO_NODE; -} - -/* - * When memory fills up on a node, memory contents can be - * automatically migrated to another node instead of - * discarded at reclaim. - * - * Establish a "migration path" which will start at nodes - * with CPUs and will follow the priorities used to build the - * page allocator zonelists. - * - * The difference here is that cycles must be avoided. If - * node0 migrates to node1, then neither node1, nor anything - * node1 migrates to can migrate to node0. Also one node can - * be migrated to multiple nodes if the target nodes all have - * a same best-distance against the source node. - * - * This function can run simultaneously with readers of - * node_demotion[]. However, it can not run simultaneously - * with itself. Exclusion is provided by memory hotplug events - * being single-threaded. - */ -static void __set_migration_target_nodes(void) -{ - nodemask_t next_pass; - nodemask_t this_pass; - nodemask_t used_targets = NODE_MASK_NONE; - int node, best_distance; - - /* - * Avoid any oddities like cycles that could occur - * from changes in the topology. This will leave - * a momentary gap when migration is disabled. - */ - disable_all_migrate_targets(); - - /* - * Allocations go close to CPUs, first. Assume that - * the migration path starts at the nodes with CPUs. - */ - next_pass = node_states[N_CPU]; -again: - this_pass = next_pass; - next_pass = NODE_MASK_NONE; - /* - * To avoid cycles in the migration "graph", ensure - * that migration sources are not future targets by - * setting them in 'used_targets'. Do this only - * once per pass so that multiple source nodes can - * share a target node. - * - * 'used_targets' will become unavailable in future - * passes. This limits some opportunities for - * multiple source nodes to share a destination. - */ - nodes_or(used_targets, used_targets, this_pass); - - for_each_node_mask(node, this_pass) { - best_distance = -1; - - /* - * Try to set up the migration path for the node, and the target - * migration nodes can be multiple, so doing a loop to find all - * the target nodes if they all have a best node distance. - */ - do { - int target_node = - establish_migrate_target(node, &used_targets, - best_distance); - - if (target_node == NUMA_NO_NODE) - break; - - if (best_distance == -1) - best_distance = node_distance(node, target_node); - - /* - * Visit targets from this pass in the next pass. - * Eventually, every node will have been part of - * a pass, and will become set in 'used_targets'. - */ - node_set(target_node, next_pass); - } while (1); - } - /* - * 'next_pass' contains nodes which became migration - * targets in this pass. Make additional passes until - * no more migrations targets are available. - */ - if (!nodes_empty(next_pass)) - goto again; -} - -/* - * For callers that do not hold get_online_mems() already. - */ -void set_migration_target_nodes(void) -{ - get_online_mems(); - __set_migration_target_nodes(); - put_online_mems(); -} - -/* - * This leaves migrate-on-reclaim transiently disabled between - * the MEM_GOING_OFFLINE and MEM_OFFLINE events. This runs - * whether reclaim-based migration is enabled or not, which - * ensures that the user can turn reclaim-based migration at - * any time without needing to recalculate migration targets. - * - * These callbacks already hold get_online_mems(). That is why - * __set_migration_target_nodes() can be used as opposed to - * set_migration_target_nodes(). - */ -#ifdef CONFIG_MEMORY_HOTPLUG -static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, - unsigned long action, void *_arg) -{ - struct memory_notify *arg = _arg; - - /* - * Only update the node migration order when a node is - * changing status, like online->offline. This avoids - * the overhead of synchronize_rcu() in most cases. - */ - if (arg->status_change_nid < 0) - return notifier_from_errno(0); - - switch (action) { - case MEM_GOING_OFFLINE: - /* - * Make sure there are not transient states where - * an offline node is a migration target. This - * will leave migration disabled until the offline - * completes and the MEM_OFFLINE case below runs. - */ - disable_all_migrate_targets(); - break; - case MEM_OFFLINE: - case MEM_ONLINE: - /* - * Recalculate the target nodes once the node - * reaches its final state (online or offline). - */ - __set_migration_target_nodes(); - break; - case MEM_CANCEL_OFFLINE: - /* - * MEM_GOING_OFFLINE disabled all the migration - * targets. Reenable them. - */ - __set_migration_target_nodes(); - break; - case MEM_GOING_ONLINE: - case MEM_CANCEL_ONLINE: - break; - } - - return notifier_from_errno(0); -} -#endif - -void __init migrate_on_reclaim_init(void) -{ - node_demotion = kcalloc(nr_node_ids, - sizeof(struct demotion_nodes), - GFP_KERNEL); - WARN_ON(!node_demotion); -#ifdef CONFIG_MEMORY_HOTPLUG - hotplug_memory_notifier(migrate_on_reclaim_callback, 100); -#endif - /* - * At this point, all numa nodes with memory/CPus have their state - * properly set, so we can build the demotion order now. - * Let us hold the cpu_hotplug lock just, as we could possibily have - * CPU hotplug events during boot. - */ - cpus_read_lock(); - set_migration_target_nodes(); - cpus_read_unlock(); -} #endif /* CONFIG_NUMA */ - - diff --git a/mm/vmstat.c b/mm/vmstat.c index 373d2730fcf2..35c6ff97cf29 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -28,7 +28,6 @@ #include #include #include -#include #include "internal.h" @@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu) if (!node_state(cpu_to_node(cpu), N_CPU)) { node_set_state(cpu_to_node(cpu), N_CPU); - set_migration_target_nodes(); } return 0; @@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu) return 0; node_clear_state(node, N_CPU); - set_migration_target_nodes(); return 0; } @@ -2118,7 +2115,6 @@ void __init init_mm_internals(void) start_shepherd_timer(); #endif - migrate_on_reclaim_init(); #ifdef CONFIG_PROC_FS proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op); proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op); From patchwork Mon Jul 4 07:06:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12904788 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C3C2C43334 for ; Mon, 4 Jul 2022 07:41:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 049436B0072; Mon, 4 Jul 2022 03:41:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F3B5C6B0073; Mon, 4 Jul 2022 03:41:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DDBA58E0001; Mon, 4 Jul 2022 03:41:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id CAECF6B0072 for ; Mon, 4 Jul 2022 03:41:47 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id A6FD21DFD for ; Mon, 4 Jul 2022 07:08:29 +0000 (UTC) X-FDA: 79648539180.30.9159DE1 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf12.hostedemail.com (Postfix) with ESMTP id 028004006B for ; Mon, 4 Jul 2022 07:08:28 +0000 (UTC) Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 26476HC6010034; Mon, 4 Jul 2022 07:08:23 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=yVE3XSpytCkYQ52S13vmXw5vKPuMgBybUJqrbPR+cPc=; b=CSa0Z4xm4/PK45aUXNYZ4BYCNrLUmrHpebghlKM3YAI9/PWi4Iy/UTB1uwV6Sx3Fv7Z7 PrXYfCaTSIFXhh+cDs1MxaHEGXDPmo2bLcKy253OaPEqG1KlrQfcOVqBs4ILkt7miaX5 Xcxq2+NhR85Kx9i585XNei8TRPrZJIs3/z/dIUA47R5n72/VvHhA9n5UNmhAQ4Ahfa8D aY7cLzjJ2Z22n0Zv6Pf6VRGgOdYagyjkUBJ5CUvSvvYIBj9Li10jZh/g+5u7ETfhtO5f s6qcGMk2pVQcce6HSNNrCCPxYXPOOtj4iopXSBlIbMO7r3l97h7oxgUM5uX3PM/sI5/o 5Q== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3n2kyhwu-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:22 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 26478MAK027899; Mon, 4 Jul 2022 07:08:22 GMT Received: from ppma01wdc.us.ibm.com (fd.55.37a9.ip4.static.sl-reverse.com [169.55.85.253]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3n2kyhw6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:22 +0000 Received: from pps.filterd (ppma01wdc.us.ibm.com [127.0.0.1]) by ppma01wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26475Htp031799; Mon, 4 Jul 2022 07:08:20 GMT Received: from b01cxnp22035.gho.pok.ibm.com (b01cxnp22035.gho.pok.ibm.com [9.57.198.25]) by ppma01wdc.us.ibm.com with ESMTP id 3h2dn8u52m-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:20 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22035.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26478KKW36045160 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 4 Jul 2022 07:08:20 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5C5BE124054; Mon, 4 Jul 2022 07:08:20 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C4873124053; Mon, 4 Jul 2022 07:08:13 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.74.198]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Mon, 4 Jul 2022 07:08:13 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v8 06/12] mm/demotion: Expose memory tier details via sysfs Date: Mon, 4 Jul 2022 12:36:06 +0530 Message-Id: <20220704070612.299585-7-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> References: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: t15EEGhTStporIZ46pRCd8UUhn2OP7ey X-Proofpoint-ORIG-GUID: 5jJhGHHY6PJ8rPVZzOO5qwyitj-mfw0t X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-04_05,2022-06-28_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 impostorscore=0 bulkscore=0 lowpriorityscore=0 suspectscore=0 spamscore=0 mlxscore=0 priorityscore=1501 phishscore=0 mlxlogscore=999 adultscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2207040030 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656918509; a=rsa-sha256; cv=none; b=QyGiilUBx5La/K2tRvoMvRkppsEE5W6mPdvYXh1VSj1Kh9OHwSFNBf4iC2uED2eoEbXVuN kTRMBfZdvZW0To7V3wHJoQ0flCtI05SunuuO/PFL0+yeTs5GR+7T3K37AqYp6j5K9lAsl2 J83CUzk8xvZ1hhUJSElSNak8ue1uEFQ= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=CSa0Z4xm; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf12.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656918509; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yVE3XSpytCkYQ52S13vmXw5vKPuMgBybUJqrbPR+cPc=; b=bOekZh8zrBi+Ic/K3rAzabJ0ICCgW0Z5pKPdot18OtGOoNvFoab3RrYV5IArpyaI3bLZXQ JPfJFcka/CRZjvkbWO2EfMDNW8xLu1SrVsbFXArsZxOuwFoEJ1O9XYIgqanpxiqRu/JARu bjF7QOqr3W6chcR51NgqGHxyfP/+VUc= Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=CSa0Z4xm; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf12.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 028004006B X-Rspam-User: X-Stat-Signature: cdw6pihohj6xxk3aipb75f6mxx4giksu X-HE-Tag: 1656918508-378185 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch adds /sys/devices/system/memtier/ where all memory tier related details can be found. All created memory tiers will be listed there as /sys/devices/system/memtier/memtierN/ The nodes which are part of a specific memory tier can be listed via /sys/devices/system/memtier/memtierN/nodelist /sys/devices/system/memtier/max_tier shows the max tier ID value supported. /sys/devices/system/memtier/default_tier shows the memory tier to which NUMA nodes get added by default if not assigned a specific memory tier. Signed-off-by: Aneesh Kumar K.V --- mm/memory-tiers.c | 93 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 87 insertions(+), 6 deletions(-) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 0596f0b11065..4acf7570ae1b 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -13,14 +13,15 @@ struct memory_tier { struct list_head list; + struct device dev; nodemask_t nodelist; - int id; }; struct demotion_nodes { nodemask_t preferred; }; +#define to_memory_tier(device) container_of(device, struct memory_tier, dev) static void establish_migration_targets(void); static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); @@ -86,6 +87,42 @@ static LIST_HEAD(memory_tiers); */ static struct demotion_nodes *node_demotion __read_mostly; +static struct bus_type memory_tier_subsys = { + .name = "memtier", + .dev_name = "memtier", +}; + +static ssize_t nodelist_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct memory_tier *memtier = to_memory_tier(dev); + + return sysfs_emit(buf, "%*pbl\n", + nodemask_pr_args(&memtier->nodelist)); +} +static DEVICE_ATTR_RO(nodelist); + +static struct attribute *memory_tier_dev_attrs[] = { + &dev_attr_nodelist.attr, + NULL +}; + +static const struct attribute_group memory_tier_dev_group = { + .attrs = memory_tier_dev_attrs, +}; + +static const struct attribute_group *memory_tier_dev_groups[] = { + &memory_tier_dev_group, + NULL +}; + +static void memory_tier_device_release(struct device *dev) +{ + struct memory_tier *tier = to_memory_tier(dev); + + kfree(tier); +} + static void insert_memory_tier(struct memory_tier *memtier) { struct list_head *ent; @@ -95,7 +132,7 @@ static void insert_memory_tier(struct memory_tier *memtier) list_for_each(ent, &memory_tiers) { tmp_memtier = list_entry(ent, struct memory_tier, list); - if (tmp_memtier->id < memtier->id) { + if (tmp_memtier->dev.id < memtier->dev.id) { list_add_tail(&memtier->list, ent); return; } @@ -105,6 +142,7 @@ static void insert_memory_tier(struct memory_tier *memtier) static struct memory_tier *register_memory_tier(unsigned int tier) { + int error; struct memory_tier *memtier; if (tier > MAX_MEMORY_TIER_ID) @@ -114,17 +152,26 @@ static struct memory_tier *register_memory_tier(unsigned int tier) if (!memtier) return ERR_PTR(-ENOMEM); - memtier->id = tier; + memtier->dev.id = tier; + memtier->dev.bus = &memory_tier_subsys; + memtier->dev.release = memory_tier_device_release; + memtier->dev.groups = memory_tier_dev_groups; insert_memory_tier(memtier); + error = device_register(&memtier->dev); + if (error) { + list_del(&memtier->list); + put_device(&memtier->dev); + return ERR_PTR(error); + } return memtier; } static void unregister_memory_tier(struct memory_tier *memtier) { list_del(&memtier->list); - kfree(memtier); + device_unregister(&memtier->dev); } static struct memory_tier *__node_get_memory_tier(int node) @@ -143,7 +190,7 @@ static struct memory_tier *__get_memory_tier_from_id(int id) struct memory_tier *memtier; list_for_each_entry(memtier, &memory_tiers, list) { - if (memtier->id == id) + if (memtier->dev.id == id) return memtier; } return NULL; @@ -181,7 +228,7 @@ int node_create_and_set_memory_tier(int node, int tier) goto out; } - if (current_tier->id == tier) + if (current_tier->dev.id == tier) goto out; node_clear(node, current_tier->nodelist); @@ -426,10 +473,44 @@ static void __init migrate_on_reclaim_init(void) hotplug_memory_notifier(migrate_on_reclaim_callback, 100); } +static ssize_t +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIER_ID); +} +static DEVICE_ATTR_RO(max_tier); + +static ssize_t +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "memtier%d\n", default_memtier); +} +static DEVICE_ATTR_RO(default_tier); + +static struct attribute *memory_tier_attrs[] = { + &dev_attr_max_tier.attr, + &dev_attr_default_tier.attr, + NULL +}; + +static const struct attribute_group memory_tier_attr_group = { + .attrs = memory_tier_attrs, +}; + +static const struct attribute_group *memory_tier_attr_groups[] = { + &memory_tier_attr_group, + NULL, +}; + static int __init memory_tier_init(void) { + int ret; struct memory_tier *memtier; + ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); + if (ret) + pr_err("%s() failed to register subsystem: %d\n", __func__, ret); + /* * Register only default memory tier to hide all empty * memory tier from sysfs. Since this is early during From patchwork Mon Jul 4 07:06:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12904772 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5C53C43334 for ; Mon, 4 Jul 2022 07:11:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 67E5D6B0075; Mon, 4 Jul 2022 03:11:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 62E2C900003; Mon, 4 Jul 2022 03:11:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4F5CC900002; Mon, 4 Jul 2022 03:11:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 3E0A76B0075 for ; Mon, 4 Jul 2022 03:11:09 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 3727721FF6 for ; Mon, 4 Jul 2022 07:08:42 +0000 (UTC) X-FDA: 79648539726.28.BE7A1ED Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf13.hostedemail.com (Postfix) with ESMTP id C5AA120006 for ; Mon, 4 Jul 2022 07:08:39 +0000 (UTC) Received: from pps.filterd (m0098399.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2646Lv1w021158; Mon, 4 Jul 2022 07:08:31 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=bUKtfnF5lp5n6saH/VnpP793cWjuWytUL0WlkyT0+FI=; b=KPrDzA4I9odNsEc1Xai0mj5c+XdP9lCGFvxtQDtSGfokHBAsdR/mOgOD7hMbPS5wjiuk a9GdtaecKLb2vWbPc92h+vyYDhEc55EBjAxc4zQfcg0UjhGnwFb+BWPaYbx64w3RzVBu bbqhF3sWIUMZx5g/MnqPulA2wM9nbg6BcOZAHAgZ2hMYYB8xk84IA+wMK85vukZO+FWk JQD4KpnIORLJN5TmIjWk4pgMH0Foic4sIoC/jXgzC2mugVY+JNMFDq386YtKNl9JWtcw mvfEMQ2OQut0utvsNkGFiR3KW5+5FgTpPc5ScmMLGA4jHDtKfo+791GukZLSfTUcB7+v HQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3tva10xp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:30 +0000 Received: from m0098399.ppops.net (m0098399.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2646jZ3I028460; Mon, 4 Jul 2022 07:08:30 GMT Received: from ppma02dal.us.ibm.com (a.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.10]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3tva10xa-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:30 +0000 Received: from pps.filterd (ppma02dal.us.ibm.com [127.0.0.1]) by ppma02dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26476Vpd001017; Mon, 4 Jul 2022 07:08:29 GMT Received: from b01cxnp22033.gho.pok.ibm.com (b01cxnp22033.gho.pok.ibm.com [9.57.198.23]) by ppma02dal.us.ibm.com with ESMTP id 3h2dn9haam-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:29 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26478SwS25493806 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 4 Jul 2022 07:08:28 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E84AC124066; Mon, 4 Jul 2022 07:08:27 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 12592124054; Mon, 4 Jul 2022 07:08:21 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.74.198]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Mon, 4 Jul 2022 07:08:20 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" , Jagdish Gediya Subject: [PATCH v8 07/12] mm/demotion: Add per node memory tier attribute to sysfs Date: Mon, 4 Jul 2022 12:36:07 +0530 Message-Id: <20220704070612.299585-8-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> References: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: aOEW1KYLC8uCLmkP_4ZdQzfdNl6ASTLn X-Proofpoint-ORIG-GUID: U3SypYztW-2nITouAoURAKEE9FyUUbUc X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-04_05,2022-06-28_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 mlxscore=0 suspectscore=0 lowpriorityscore=0 adultscore=0 mlxlogscore=999 bulkscore=0 spamscore=0 impostorscore=0 malwarescore=0 clxscore=1015 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2207040030 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656918520; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bUKtfnF5lp5n6saH/VnpP793cWjuWytUL0WlkyT0+FI=; b=ilm56v7xvhuLXC5xBAxnm6LgVnplfl0AO5OGfwajeJXsRyhhLzTjV0Qcb9NSRnPpiNoxoq 0p+O9EM6vhMLsQDqMdoiqCpk2aPXQT/auHM1YIbR+P7XaWn4xmI8djzBe5nNnuntVdshIY /nkaKLJPMRm3IsVmRWETcQu3vkDd3oM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656918520; a=rsa-sha256; cv=none; b=ehhncfgS9/8J9PmKgE1bDRf0pf1sOKmY8OAh02bgmcF+qs9B0ujUo3E4e+s0zQ7bUI3Xrx wk3vYM2x0pCupSDO2eZq2mTl7+egzuVAQrIKQrQ4WPhuPbdFIwe+hEb+UFrADWWCvUwAj7 YmR93hzASv6+kGTiTspDlKYInr7qq8I= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=KPrDzA4I; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf13.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Stat-Signature: fmrw3e768qmdwh69zu3aenr39ha4f58f X-Rspamd-Queue-Id: C5AA120006 X-Rspam-User: Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=KPrDzA4I; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf13.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam06 X-HE-Tag: 1656918519-9553 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add support to modify the memory tier for a NUMA node. /sys/devices/system/node/nodeN/memtier where N = node id When read, It list the memory tier that the node belongs to. When written, the kernel moves the node into the specified memory tier, the tier assignment of all other nodes are not affected. If the memory tier does not exist, it is created. Suggested-by: Wei Xu Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- drivers/base/node.c | 42 ++++++++++++++++++++++++++++++++++++ include/linux/memory-tiers.h | 2 ++ mm/memory-tiers.c | 42 ++++++++++++++++++++++++++++++++++++ 3 files changed, 86 insertions(+) diff --git a/drivers/base/node.c b/drivers/base/node.c index 0ac6376ef7a1..667f37eecf3a 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -20,6 +20,7 @@ #include #include #include +#include static struct bus_type node_subsys = { .name = "node", @@ -560,11 +561,52 @@ static ssize_t node_read_distance(struct device *dev, } static DEVICE_ATTR(distance, 0444, node_read_distance, NULL); +#ifdef CONFIG_NUMA +static ssize_t memtier_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + int node = dev->id; + int tier_index = node_get_memory_tier_id(node); + + /* + * CPU only NUMA node is not part of memory tiers. + */ + if (tier_index != -1) + return sysfs_emit(buf, "%d\n", tier_index); + return 0; +} + +static ssize_t memtier_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + unsigned long tier; + int node = dev->id; + int ret; + + ret = kstrtoul(buf, 10, &tier); + if (ret) + return ret; + + ret = node_update_memory_tier(node, tier); + if (ret) + return ret; + + return count; +} + +static DEVICE_ATTR_RW(memtier); +#endif + static struct attribute *node_dev_attrs[] = { &dev_attr_meminfo.attr, &dev_attr_numastat.attr, &dev_attr_distance.attr, &dev_attr_vmstat.attr, +#ifdef CONFIG_NUMA + &dev_attr_memtier.attr, +#endif NULL }; diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 3234301c2537..453f6e5d357c 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -23,6 +23,8 @@ static inline int next_demotion_node(int node) return NUMA_NO_NODE; } #endif +int node_get_memory_tier_id(int node); +int node_update_memory_tier(int node, int tier); #else diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 4acf7570ae1b..b7cb368cb9c0 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -288,6 +288,48 @@ static int node_set_memory_tier(int node, int tier) return ret; } +int node_get_memory_tier_id(int node) +{ + int tier = -1; + struct memory_tier *memtier; + /* + * Make sure memory tier is not unregistered + * while it is being read. + */ + mutex_lock(&memory_tier_lock); + memtier = __node_get_memory_tier(node); + if (memtier) + tier = memtier->dev.id; + mutex_unlock(&memory_tier_lock); + + return tier; +} + +int node_update_memory_tier(int node, int tier) +{ + struct memory_tier *current_tier; + int ret = 0; + + mutex_lock(&memory_tier_lock); + + current_tier = __node_get_memory_tier(node); + if (!current_tier || current_tier->dev.id == tier) + goto out; + + node_clear(node, current_tier->nodelist); + + ret = __node_create_and_set_memory_tier(node, tier); + + if (nodes_empty(current_tier->nodelist)) + unregister_memory_tier(current_tier); + + establish_migration_targets(); +out: + mutex_unlock(&memory_tier_lock); + + return ret; +} + #ifdef CONFIG_MIGRATION /** * next_demotion_node() - Get the next node in the demotion path From patchwork Mon Jul 4 07:06:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12904773 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E8126C43334 for ; Mon, 4 Jul 2022 07:11:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8A0B66B0075; Mon, 4 Jul 2022 03:11:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 850B2900003; Mon, 4 Jul 2022 03:11:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 71A5E900002; Mon, 4 Jul 2022 03:11:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 63C2C6B0075 for ; Mon, 4 Jul 2022 03:11:24 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 7057561B37 for ; Mon, 4 Jul 2022 07:08:50 +0000 (UTC) X-FDA: 79648540020.01.2876372 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf29.hostedemail.com (Postfix) with ESMTP id A45CE1200B3 for ; Mon, 4 Jul 2022 07:08:49 +0000 (UTC) Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2646bBEX016228; Mon, 4 Jul 2022 07:08:38 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=ihEfIiqN6xcfrPmWYtWVjteEcOUWklsx8zGbP+bZdjI=; b=obf7n+ROewIxczMjeDUXXeI91387JlbgUBH6VD0tdiQHkaNXpPhEQib5qaI8mk/tjYXC tk2bFj+H4ULxkz1DhQMjHaYh/rjw4VJ4ecTW1h5Oho1zUUc1n87xHeMlc9yxFOtZbwle 2ejP+Gtv00tHYNw+06KTQQKZYNdVib0aAU58G+PiX0D8vMVI5YTqa6TmamwhdiFruJHM E0wM9Xfz+jErrLtvBzKbhvaSjMjNj27Df+EauVgI4ET9Q/Hda8CrOmNnD6KcfSeVCgtR LgBUZoqDZO0l1mM164pU1iMiVi9UMaOEhoIiFr+lD1qkSGEjU7gkhyOaKrjojxaNCx6G Zw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3tvc0u6j-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:37 +0000 Received: from m0098404.ppops.net (m0098404.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2646uhPP022888; Mon, 4 Jul 2022 07:08:37 GMT Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3tvc0u62-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:37 +0000 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26476A1P007094; Mon, 4 Jul 2022 07:08:36 GMT Received: from b01cxnp22033.gho.pok.ibm.com (b01cxnp22033.gho.pok.ibm.com [9.57.198.23]) by ppma03dal.us.ibm.com with ESMTP id 3h2dn9hau7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:36 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26478ZhW34865504 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 4 Jul 2022 07:08:35 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 348C312405C; Mon, 4 Jul 2022 07:08:35 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 85078124058; Mon, 4 Jul 2022 07:08:28 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.74.198]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Mon, 4 Jul 2022 07:08:28 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v8 08/12] mm/demotion: Add pg_data_t member to track node memory tier details Date: Mon, 4 Jul 2022 12:36:08 +0530 Message-Id: <20220704070612.299585-9-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> References: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: Jtk3tZHs4dXwEYmPJaxChmM0tzgqVuTJ X-Proofpoint-GUID: 6ijjkrJYh81HQ6HrLt_jriuGP5eUgHSO X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-04_05,2022-06-28_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 spamscore=0 impostorscore=0 lowpriorityscore=0 priorityscore=1501 suspectscore=0 bulkscore=0 phishscore=0 mlxscore=0 mlxlogscore=999 malwarescore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2207040030 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656918529; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ihEfIiqN6xcfrPmWYtWVjteEcOUWklsx8zGbP+bZdjI=; b=4/j6qjL8tMltoGAO8FT13fzFARS7nX5ykgT3madbXx1jhAst2RmY1pls7qXABNLTwiYKPQ TkQG/Y/AUMwEkcjFkylcR4vMihZOQlPK6d+2cET0f1qh0Z3dnqCUqWAJk3Ss/7hIcAe/is k1B0lxcLsnvhf2ZMsP7aKTz5mSrBNmI= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=obf7n+RO; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf29.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656918529; a=rsa-sha256; cv=none; b=5da2Wv80sgfRSrN3gUPPcOMhE1GFcTwrQQ0ZkACHYJnQrqlwPrRTfEkGwem39WR+5k6IXJ HaDHTnuPVyU8VCVnkD5g3xWWDZR6SGXK2CY4J5DpUUGViCll9ls4k0673TTKpftPOQnMcJ LZXUWnojU7HWurpJMae4YNrLi+diYew= Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=obf7n+RO; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf29.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: A45CE1200B3 X-Stat-Signature: t6gox7a6aj8fqu617na88pce15ynrxpc X-Rspam-User: X-HE-Tag: 1656918529-271996 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Also update different helpes to use NODE_DATA()->memtier. Since node specific memtier can change based on the reassignment of NUMA node to a different memory tiers, accessing NODE_DATA()->memtier needs to happen under an rcu read lock or memory_tier_lock. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 11 ++++ include/linux/mmzone.h | 3 + mm/memory-tiers.c | 104 +++++++++++++++++++++++++---------- 3 files changed, 89 insertions(+), 29 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 453f6e5d357c..705b63ee31d5 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -6,6 +6,9 @@ #ifdef CONFIG_NUMA +#include +#include + #define MEMORY_TIER_HBM_GPU 300 #define MEMORY_TIER_DRAM 200 #define MEMORY_TIER_PMEM 100 @@ -13,6 +16,12 @@ #define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM #define MAX_MEMORY_TIER_ID 400 +struct memory_tier { + struct list_head list; + struct device dev; + nodemask_t nodelist; +}; + extern bool numa_demotion_enabled; int node_create_and_set_memory_tier(int node, int tier); #ifdef CONFIG_MIGRATION @@ -25,6 +34,8 @@ static inline int next_demotion_node(int node) #endif int node_get_memory_tier_id(int node); int node_update_memory_tier(int node, int tier); +struct memory_tier *node_get_memory_tier(int node); +void node_put_memory_tier(struct memory_tier *memtier); #else diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index aab70355d64f..353812495a70 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -928,6 +928,9 @@ typedef struct pglist_data { /* Per-node vmstats */ struct per_cpu_nodestat __percpu *per_cpu_nodestats; atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; +#ifdef CONFIG_NUMA + struct memory_tier __rcu *memtier; +#endif } pg_data_t; #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index b7cb368cb9c0..6a2476faf13a 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -1,22 +1,15 @@ // SPDX-License-Identifier: GPL-2.0 #include -#include -#include #include #include #include #include #include +#include #include #include "internal.h" -struct memory_tier { - struct list_head list; - struct device dev; - nodemask_t nodelist; -}; - struct demotion_nodes { nodemask_t preferred; }; @@ -120,7 +113,7 @@ static void memory_tier_device_release(struct device *dev) { struct memory_tier *tier = to_memory_tier(dev); - kfree(tier); + kfree_rcu(tier); } static void insert_memory_tier(struct memory_tier *memtier) @@ -176,13 +169,18 @@ static void unregister_memory_tier(struct memory_tier *memtier) static struct memory_tier *__node_get_memory_tier(int node) { - struct memory_tier *memtier; + pg_data_t *pgdat; - list_for_each_entry(memtier, &memory_tiers, list) { - if (node_isset(node, memtier->nodelist)) - return memtier; - } - return NULL; + pgdat = NODE_DATA(node); + if (!pgdat) + return NULL; + /* + * Since we hold memory_tier_lock, we can avoid + * RCU read locks when accessing the details. No + * parallel updates are possible here. + */ + return rcu_dereference_check(pgdat->memtier, + lockdep_is_held(&memory_tier_lock)); } static struct memory_tier *__get_memory_tier_from_id(int id) @@ -196,6 +194,33 @@ static struct memory_tier *__get_memory_tier_from_id(int id) return NULL; } +/* + * Called with memory_tier_lock. Hence the device references cannot + * be dropped during this function. + */ +static void memtier_node_set(int node, struct memory_tier *memtier) +{ + pg_data_t *pgdat; + struct memory_tier *current_memtier; + + pgdat = NODE_DATA(node); + if (!pgdat) + return; + /* + * Make sure we mark the memtier NULL before we assign the new memory tier + * to the NUMA node. This make sure that anybody looking at NODE_DATA + * finds a NULL memtier or the one which is still valid. + */ + current_memtier = rcu_dereference_check(pgdat->memtier, + lockdep_is_held(&memory_tier_lock)); + rcu_assign_pointer(pgdat->memtier, NULL); + if (current_memtier) + node_clear(node, current_memtier->nodelist); + synchronize_rcu(); + node_set(node, memtier->nodelist); + rcu_assign_pointer(pgdat->memtier, memtier); +} + static int __node_create_and_set_memory_tier(int node, int tier) { int ret = 0; @@ -209,7 +234,7 @@ static int __node_create_and_set_memory_tier(int node, int tier) goto out; } } - node_set(node, memtier->nodelist); + memtier_node_set(node, memtier); out: return ret; } @@ -231,14 +256,7 @@ int node_create_and_set_memory_tier(int node, int tier) if (current_tier->dev.id == tier) goto out; - node_clear(node, current_tier->nodelist); - ret = __node_create_and_set_memory_tier(node, tier); - if (ret) { - /* reset it back to older tier */ - node_set(node, current_tier->nodelist); - goto out; - } if (nodes_empty(current_tier->nodelist)) unregister_memory_tier(current_tier); @@ -260,7 +278,7 @@ static int __node_set_memory_tier(int node, int tier) ret = -EINVAL; goto out; } - node_set(node, memtier->nodelist); + memtier_node_set(node, memtier); out: return ret; } @@ -316,10 +334,7 @@ int node_update_memory_tier(int node, int tier) if (!current_tier || current_tier->dev.id == tier) goto out; - node_clear(node, current_tier->nodelist); - ret = __node_create_and_set_memory_tier(node, tier); - if (nodes_empty(current_tier->nodelist)) unregister_memory_tier(current_tier); @@ -330,6 +345,34 @@ int node_update_memory_tier(int node, int tier) return ret; } +/* + * lockless access to memory tier of a NUMA node. + */ +struct memory_tier *node_get_memory_tier(int node) +{ + pg_data_t *pgdat; + struct memory_tier *memtier; + + pgdat = NODE_DATA(node); + if (!pgdat) + return NULL; + + rcu_read_lock(); + memtier = rcu_dereference(pgdat->memtier); + if (!memtier) + goto out; + + get_device(&memtier->dev); +out: + rcu_read_unlock(); + return memtier; +} + +void node_put_memory_tier(struct memory_tier *memtier) +{ + put_device(&memtier->dev); +} + #ifdef CONFIG_MIGRATION /** * next_demotion_node() - Get the next node in the demotion path @@ -546,7 +589,7 @@ static const struct attribute_group *memory_tier_attr_groups[] = { static int __init memory_tier_init(void) { - int ret; + int ret, node; struct memory_tier *memtier; ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); @@ -567,7 +610,10 @@ static int __init memory_tier_init(void) __func__, PTR_ERR(memtier)); /* CPU only nodes are not part of memory tiers. */ - memtier->nodelist = node_states[N_MEMORY]; + for_each_node_state(node, N_MEMORY) { + rcu_assign_pointer(NODE_DATA(node)->memtier, memtier); + node_set(node, memtier->nodelist); + } mutex_unlock(&memory_tier_lock); migrate_on_reclaim_init(); From patchwork Mon Jul 4 07:06:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12904771 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6157DC43334 for ; Mon, 4 Jul 2022 07:10:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F30026B0074; Mon, 4 Jul 2022 03:10:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EDF81900003; Mon, 4 Jul 2022 03:10:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DA83C900002; Mon, 4 Jul 2022 03:10:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id CBF756B0074 for ; Mon, 4 Jul 2022 03:10:41 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id DF0AD3612F for ; Mon, 4 Jul 2022 07:08:55 +0000 (UTC) X-FDA: 79648540272.26.2591F29 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf20.hostedemail.com (Postfix) with ESMTP id C12DF1C00AC for ; Mon, 4 Jul 2022 07:08:54 +0000 (UTC) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2646kRnu001325; Mon, 4 Jul 2022 07:08:46 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=A961FvOMvs+LQoQGAxd2mt9Of4qC3gvsVjeVM6KH7CY=; b=MkhwIOhHC9xe28yJHMfJ03w9gJhzXkGynHzhhSc3gBn+ajkJn3fmxkBE2icjaPb1BiiN x3HsGrNr/Z3BNnHVukXA0lIJtDFFxrSZy4yMWvPbxBofWKZNxXYgX2wI4pw1oKvYgcn1 OFc0s6h6+eDcIn9uO/FEBU0oAzrqTBJgYf/xupYwl9nYQ+bny3OTyx5EduKd7fGaJNdR CsCDOuYLlXq7qokOIPCtubzET4Xzun1pT47UvYwAuxES/T6XigBJ+P/e+5ZrNIqPRuY/ Or9h8wIFvRvUR7+E2ZLptV3QeL2JElxSbxNclNe9GV6L4GgKckwFEX19H/lDM2DJ07Lq Hg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3u7qrf0w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:46 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 26475IrS014588; Mon, 4 Jul 2022 07:08:45 GMT Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3u7qrf0b-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:45 +0000 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26475RJd030714; Mon, 4 Jul 2022 07:08:44 GMT Received: from b01cxnp22036.gho.pok.ibm.com (b01cxnp22036.gho.pok.ibm.com [9.57.198.26]) by ppma04wdc.us.ibm.com with ESMTP id 3h2dn9b4j2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:44 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22036.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26478hJF7799590 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 4 Jul 2022 07:08:43 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 4A042124052; Mon, 4 Jul 2022 07:08:43 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D017C12405C; Mon, 4 Jul 2022 07:08:35 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.74.198]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Mon, 4 Jul 2022 07:08:35 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Jagdish Gediya , "Aneesh Kumar K . V" Subject: [PATCH v8 09/12] mm/demotion: Demote pages according to allocation fallback order Date: Mon, 4 Jul 2022 12:36:09 +0530 Message-Id: <20220704070612.299585-10-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> References: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: up_fBrZDVoaAp-fedvOsdfaLCBcS-zMO X-Proofpoint-ORIG-GUID: s2_ld_JrrYcYAGNY7E1zvuOvjLhGBvdy X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-04_05,2022-06-28_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 lowpriorityscore=0 malwarescore=0 bulkscore=0 mlxscore=0 phishscore=0 mlxlogscore=999 adultscore=0 suspectscore=0 impostorscore=0 priorityscore=1501 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2207040030 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656918535; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=A961FvOMvs+LQoQGAxd2mt9Of4qC3gvsVjeVM6KH7CY=; b=2AvYkWHblEzLDv6BTNKW4h+mbUg+mKo8c1SGQ/sVaC9ewg3W4VJU82bEroLOeVOWOJKCt8 m4P+NNfJSM8aDZnvO5OZCCuy/wL4FIRmPwOKxc+fTz+NN6qmjz07qhzPscPTkGsZr0uiwG czFsi8fX+8c6f1K1s34aPI2Sd92PdII= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=MkhwIOhH; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf20.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656918535; a=rsa-sha256; cv=none; b=CnjE73V2hqU3a1PvJEVmBg6x79kHcvKkAN8fdLGbARuRyjbD1OMb49xNqbgvM+D69FJ+Uq tgGntUFhN8g3eNcpnc/FAVVc/IcCdKV5vRLN4y7suDwoBXsbsKyy0E0GyJnL5lqqGpInlg stsNwdLYOidSocvxqi4m9kj6//hJFds= Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=MkhwIOhH; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf20.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: C12DF1C00AC X-Stat-Signature: wkkjg58egsgsi653qzfyswn18tj15xnd X-Rspam-User: X-HE-Tag: 1656918534-95788 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya Currently, a higher tier node can only be demoted to selected nodes on the next lower tier as defined by the demotion path. This strict, hard-coded demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space). This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that currently. This patch adds support to get all the allowed demotion targets for a memory tier. demote_page_list() function is now modified to utilize this allowed node mask as the fallback allocation mask. Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V move allowed mask to memory tier --- include/linux/memory-tiers.h | 17 +++++++- mm/memory-tiers.c | 76 +++++++++++++++++++++++++++++++++--- mm/vmscan.c | 58 ++++++++++++++++++++------- 3 files changed, 129 insertions(+), 22 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 705b63ee31d5..335d21a30b2c 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -3,11 +3,12 @@ #define _LINUX_MEMORY_TIERS_H #include +#include +#include #ifdef CONFIG_NUMA #include -#include #define MEMORY_TIER_HBM_GPU 300 #define MEMORY_TIER_DRAM 200 @@ -20,18 +21,25 @@ struct memory_tier { struct list_head list; struct device dev; nodemask_t nodelist; + nodemask_t lower_tier_mask; }; extern bool numa_demotion_enabled; int node_create_and_set_memory_tier(int node, int tier); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); #else static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } -#endif + +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + *targets = NODE_MASK_NONE; +} +#endif /* CONFIG_MIGRATION */ int node_get_memory_tier_id(int node); int node_update_memory_tier(int node, int tier); struct memory_tier *node_get_memory_tier(int node); @@ -49,5 +57,10 @@ static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } + +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + *targets = NODE_MASK_NONE; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 6a2476faf13a..aecce987df7c 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -374,6 +374,24 @@ void node_put_memory_tier(struct memory_tier *memtier) } #ifdef CONFIG_MIGRATION +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + struct memory_tier *memtier; + + /* + * pg_data_t.memtier updates includes a synchronize_rcu() + * which ensures that we either find NULL or a valid memtier + * in NODE_DATA. protect the access via rcu_read_lock(); + */ + rcu_read_lock(); + memtier = rcu_dereference(pgdat->memtier); + if (memtier) + *targets = memtier->lower_tier_mask; + else + *targets = NODE_MASK_NONE; + rcu_read_unlock(); +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -422,10 +440,19 @@ int next_demotion_node(int node) /* Disable reclaim-based migration. */ static void __disable_all_migrate_targets(void) { + struct memory_tier *memtier; int node; - for_each_node_state(node, N_MEMORY) + for_each_node_state(node, N_MEMORY) { node_demotion[node].preferred = NODE_MASK_NONE; + /* + * We are holding memory_tier_lock, it is safe + * to access pgda->memtier. + */ + memtier = rcu_dereference_check(NODE_DATA(node)->memtier, + lockdep_is_held(&memory_tier_lock)); + memtier->lower_tier_mask = NODE_MASK_NONE; + } } static void disable_all_migrate_targets(void) @@ -455,10 +482,26 @@ static void establish_migration_targets(void) struct demotion_nodes *nd; int target = NUMA_NO_NODE, node; int distance, best_distance; - nodemask_t used; - - if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) - return; + nodemask_t used, lower_tier = NODE_MASK_NONE; + + if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) { + + for_each_node_state(node, N_MEMORY) { + /* + * We are holding memory_tier_lock, it is safe + * to access pgda->memtier. + */ + memtier = rcu_dereference_check(NODE_DATA(node)->memtier, + lockdep_is_held(&memory_tier_lock)); + memtier->lower_tier_mask = NODE_MASK_NONE; + } + /* + * Wait for read side to work with old values + * or see the updated NODE_MASK_NONE; + */ + synchronize_rcu(); + goto build_lower_tier_mask; + } disable_all_migrate_targets(); @@ -501,6 +544,29 @@ static void establish_migration_targets(void) } } while (1); } +build_lower_tier_mask: + /* + * Now build the lower_tier mask for each node collecting node mask from + * all memory tier below it. This allows us to fallback demotion page + * allocation to a set of nodes that is closer the above selected + * perferred node. + */ + list_for_each_entry(memtier, &memory_tiers, list) + nodes_or(lower_tier, lower_tier, memtier->nodelist); + /* + * Removes nodes not yet in N_MEMORY. + */ + nodes_and(lower_tier, node_states[N_MEMORY], lower_tier); + + list_for_each_entry(memtier, &memory_tiers, list) { + /* + * Keep removing current tier from lower_tier nodes, + * This will remove all nodes in current and above + * memory tier from the lower_tier mask. + */ + nodes_andnot(lower_tier, lower_tier, memtier->nodelist); + memtier->lower_tier_mask = lower_tier; + } } static unsigned int default_memtier = DEFAULT_MEMORY_TIER; diff --git a/mm/vmscan.c b/mm/vmscan.c index 3a8f78277f99..60a5235dd639 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1460,21 +1460,34 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } -static struct page *alloc_demote_page(struct page *page, unsigned long node) +static struct page *alloc_demote_page(struct page *page, unsigned long private) { - struct migration_target_control mtc = { - /* - * Allocate from 'node', or fail quickly and quietly. - * When this happens, 'page' will likely just be discarded - * instead of migrated. - */ - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | - __GFP_THISNODE | __GFP_NOWARN | - __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid = node - }; + struct page *target_page; + nodemask_t *allowed_mask; + struct migration_target_control *mtc; + + mtc = (struct migration_target_control *)private; + + allowed_mask = mtc->nmask; + /* + * make sure we allocate from the target node first also trying to + * reclaim pages from the target node via kswapd if we are low on + * free memory on target node. If we don't do this and if we have low + * free memory on the target memtier, we would start allocating pages + * from higher memory tiers without even forcing a demotion of cold + * pages from the target memtier. This can result in the kernel placing + * hotpages in higher memory tiers. + */ + mtc->nmask = NULL; + mtc->gfp_mask |= __GFP_THISNODE; + target_page = alloc_migration_target(page, (unsigned long)mtc); + if (target_page) + return target_page; - return alloc_migration_target(page, (unsigned long)&mtc); + mtc->gfp_mask &= ~__GFP_THISNODE; + mtc->nmask = allowed_mask; + + return alloc_migration_target(page, (unsigned long)mtc); } /* @@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; + nodemask_t allowed_mask; + + struct migration_target_control mtc = { + /* + * Allocate from 'node', or fail quickly and quietly. + * When this happens, 'page' will likely just be discarded + * instead of migrated. + */ + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | + __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid = target_nid, + .nmask = &allowed_mask + }; if (list_empty(demote_pages)) return 0; @@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, if (target_nid == NUMA_NO_NODE) return 0; + node_get_allowed_targets(pgdat, &allowed_mask); + /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_pages, alloc_demote_page, NULL, - target_nid, MIGRATE_ASYNC, MR_DEMOTION, - &nr_succeeded); + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, + &nr_succeeded); if (current_is_kswapd()) __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded); From patchwork Mon Jul 4 07:06:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12904784 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 16949C433EF for ; Mon, 4 Jul 2022 07:22:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8F3456B0072; Mon, 4 Jul 2022 03:22:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8A3008E0003; Mon, 4 Jul 2022 03:22:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 76A998E0001; Mon, 4 Jul 2022 03:22:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 675266B0072 for ; Mon, 4 Jul 2022 03:22:27 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 2F5C12081 for ; Mon, 4 Jul 2022 07:09:04 +0000 (UTC) X-FDA: 79648540650.16.785D9E1 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf25.hostedemail.com (Postfix) with ESMTP id B9A5FA00B8 for ; Mon, 4 Jul 2022 07:09:03 +0000 (UTC) Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2645KfAe003077; Mon, 4 Jul 2022 07:08:53 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=hUyi/Ewdwg0NlA1nQkTma+so8NWcD0GJt0NyJpTGi2A=; b=KOsHwmqA5RTgfZE5wfm/FuijQCvnsufL9KZ5NBK/lWjWq/qPq9ob8k0U5RE2Hd0vWw9o hXHuJKnrau6IycZCBWJoWDd++zgrTZWzMAlE61i+/iv1lG/f7vX1UIviIywPqrCo23Q+ p8LVwQLFQ5Yz/lCR+UbZfVw4Ia3GO4S2xJEz/eC95vE1n4aGJxokjdxzD9vzXJv+WWHk 8o207pBtOcEjHXeJSLwnimx/Bhq7aFR1YqhxnOXM3vN6BNqpqzKirHLm69ZitgRrXidW KWkUr21waJcQCzJINHVoDP2byTtriRq5lrYde/hPYgrFj2V42dNc3U9hJhSHQaXJNFUU UA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3pd0dx41-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:53 +0000 Received: from m0098421.ppops.net (m0098421.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2646iX8S013680; Mon, 4 Jul 2022 07:08:52 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3pd0dx3c-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:52 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26475sSr029158; Mon, 4 Jul 2022 07:08:51 GMT Received: from b01cxnp22036.gho.pok.ibm.com (b01cxnp22036.gho.pok.ibm.com [9.57.198.26]) by ppma01dal.us.ibm.com with ESMTP id 3h2dn9kmqk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:51 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22036.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26478oWa8192940 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 4 Jul 2022 07:08:50 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 45DDE124052; Mon, 4 Jul 2022 07:08:50 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 10787124053; Mon, 4 Jul 2022 07:08:44 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.74.198]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Mon, 4 Jul 2022 07:08:43 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v8 10/12] mm/demotion: Update node_is_toptier to work with memory tiers Date: Mon, 4 Jul 2022 12:36:10 +0530 Message-Id: <20220704070612.299585-11-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> References: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: K21_444ex2KrAbodTZq-HdtbKBnCVsKC X-Proofpoint-GUID: hxcYd6TCiWH1BHKWODV3MJIAtj_w4oQR X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-04_05,2022-06-28_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 spamscore=0 mlxlogscore=999 mlxscore=0 bulkscore=0 suspectscore=0 priorityscore=1501 malwarescore=0 lowpriorityscore=0 clxscore=1015 impostorscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2207040030 ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=KOsHwmqA; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf25.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656918543; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hUyi/Ewdwg0NlA1nQkTma+so8NWcD0GJt0NyJpTGi2A=; b=atCpRref3bEcwIjGH4h1ja5t7ma4Tm9ZocCysrM3+4QnWux1Xb6g2wJYrkHc9weiQv57YC PjqIsVaHa3cu/4wdvRlGwj0EnVVX5hnnFAazYX3IjSR8uRM9Md4f168LlW3QwvCKIQDV9i vl7X3XwtDfJQDJ4+EZNBZw+SmfTVdkw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656918543; a=rsa-sha256; cv=none; b=osCxDqCmLIbWEFqV1fOfOQWLRip6H0xMTOJOI4Ash0bOolHN3Jt8chsVtnGHa+SMj+WQaJ +LEZ789Slo4eOMl6P7XS9vv68f7K0uZErELck+J0zI1quXVEGSqgKhWR0mMLOKdZjwH7sj AfcaxyflHq+FjVb/10PsvXvtiWQakZ0= X-Stat-Signature: kmqsh69s1p3rb8ka6gko5h6wtaw4ys8t X-Rspamd-Queue-Id: B9A5FA00B8 Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=KOsHwmqA; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf25.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam03 X-Rspam-User: X-HE-Tag: 1656918543-63223 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: With memory tiers support we can have memory only NUMA nodes in the top tier from which we want to avoid promotion tracking NUMA faults. Update node_is_toptier to work with memory tiers. All NUMA nodes are by default top tier nodes. With lower memory tiers added we consider all memory tiers above a memory tier having CPU NUMA nodes as a top memory tier Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 6 ++++++ include/linux/node.h | 5 ----- mm/huge_memory.c | 1 + mm/memory-tiers.c | 41 ++++++++++++++++++++++++++++++++++++ mm/migrate.c | 1 + mm/mprotect.c | 1 + 6 files changed, 50 insertions(+), 5 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 335d21a30b2c..ff1a08933575 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -44,6 +44,7 @@ int node_get_memory_tier_id(int node); int node_update_memory_tier(int node, int tier); struct memory_tier *node_get_memory_tier(int node); void node_put_memory_tier(struct memory_tier *memtier); +bool node_is_toptier(int node); #else @@ -62,5 +63,10 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target { *targets = NODE_MASK_NONE; } + +static inline bool node_is_toptier(int node) +{ + return true; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/node.h b/include/linux/node.h index 40d641a8bfb0..9ec680dd607f 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg, #define to_node(device) container_of(device, struct node, dev) -static inline bool node_is_toptier(int node) -{ - return node_state(node, N_CPU); -} - #endif /* _LINUX_NODE_H_ */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 834f288b3769..8405662646e9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index aecce987df7c..7204f7381a15 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -18,6 +18,7 @@ struct demotion_nodes { static void establish_migration_targets(void); static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); +static int top_tier_id; /* * node_demotion[] examples: * @@ -373,6 +374,31 @@ void node_put_memory_tier(struct memory_tier *memtier) put_device(&memtier->dev); } +bool node_is_toptier(int node) +{ + bool toptier; + pg_data_t *pgdat; + struct memory_tier *memtier; + + pgdat = NODE_DATA(node); + if (!pgdat) + return false; + + rcu_read_lock(); + memtier = rcu_dereference(pgdat->memtier); + if (!memtier) { + toptier = true; + goto out; + } + if (memtier->dev.id >= top_tier_id) + toptier = true; + else + toptier = false; +out: + rcu_read_unlock(); + return toptier; +} + #ifdef CONFIG_MIGRATION void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) { @@ -545,6 +571,21 @@ static void establish_migration_targets(void) } while (1); } build_lower_tier_mask: + /* + * Promotion is allowed from a memory tier to higher + * memory tier only if the memory tier doesn't include + * compute. We want to skip promotion from a memory tier, + * if any node that is part of the memory tier have CPUs. + * Once we detect such a memory tier, we consider that tier + * as top tiper from which promotion is not allowed. + */ + list_for_each_entry_reverse(memtier, &memory_tiers, list) { + nodes_and(used, node_states[N_CPU], memtier->nodelist); + if (!nodes_empty(used)) { + top_tier_id = memtier->dev.id; + break; + } + } /* * Now build the lower_tier mask for each node collecting node mask from * all memory tier below it. This allows us to fallback demotion page diff --git a/mm/migrate.c b/mm/migrate.c index c758c9c21d7d..1da81136eaaa 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -50,6 +50,7 @@ #include #include #include +#include #include diff --git a/mm/mprotect.c b/mm/mprotect.c index ba5592655ee3..92a2fc0fa88b 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include From patchwork Mon Jul 4 07:06:11 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12904808 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19A01C433EF for ; Mon, 4 Jul 2022 07:50:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 925156B0072; Mon, 4 Jul 2022 03:50:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8D4D96B0073; Mon, 4 Jul 2022 03:50:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 774938E0001; Mon, 4 Jul 2022 03:50:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 649086B0072 for ; Mon, 4 Jul 2022 03:50:54 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 77FDD22196 for ; Mon, 4 Jul 2022 07:09:10 +0000 (UTC) X-FDA: 79648540860.14.0850AC9 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf11.hostedemail.com (Postfix) with ESMTP id D8806400C8 for ; Mon, 4 Jul 2022 07:09:09 +0000 (UTC) Received: from pps.filterd (m0098399.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2646LW6C020522; Mon, 4 Jul 2022 07:09:01 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=NGZ2uPuNeujMrax+f9oOnAHVXQVZnbiYUjZzi+sV9iE=; b=ac4TdUhoB+LAat165uEXVscUrzDcjEDk0ayE5Gs0myWXhNjCFebC56kVjfNq+g5c71fT I1Eil8quq1XRS5KFsavn/p7cTZmNrFSBCmQCXIIuPBtNA/SXyzgQejKwCYXkw9UGY1M3 Yz1kwHD7TamVNctHV6zV9No0WDqeizgX+H9EdZQtgE0uVU7+Mt7TaXTOm8P6f9qj2gwx ZYCT4H3lvysS8OMcj7sfdqZYVPfk90U3HpY7YAklEZVgFUTKRhu9GqLTgLN/XtqNAovx r3U9598DDpiGYs7FySBcjXxtoc03TDf/Mi1WYGaAm52YadBnWJkmKUgJ7WX8TZTlKjdh IA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3tva11bd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:09:01 +0000 Received: from m0098399.ppops.net (m0098399.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2646QgrP009661; Mon, 4 Jul 2022 07:09:00 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3tva11ar-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:09:00 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26475jZ6029113; Mon, 4 Jul 2022 07:08:59 GMT Received: from b01cxnp23032.gho.pok.ibm.com (b01cxnp23032.gho.pok.ibm.com [9.57.198.27]) by ppma01dal.us.ibm.com with ESMTP id 3h2dn9kmqt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:08:59 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp23032.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26478wrf29688176 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 4 Jul 2022 07:08:58 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 12766124053; Mon, 4 Jul 2022 07:08:58 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id EE08A124054; Mon, 4 Jul 2022 07:08:50 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.74.198]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Mon, 4 Jul 2022 07:08:50 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Jagdish Gediya , "Aneesh Kumar K . V" Subject: [PATCH v8 11/12] mm/demotion: Add documentation for memory tiering Date: Mon, 4 Jul 2022 12:36:11 +0530 Message-Id: <20220704070612.299585-12-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> References: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 7mhYJ-DcNxH_Lgc-OJodmd7WOZGh40c8 X-Proofpoint-ORIG-GUID: ROKvfK2qavM5QjPDZKEfgqjMg1ATy63e X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-04_05,2022-06-28_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 mlxscore=0 suspectscore=0 lowpriorityscore=0 adultscore=0 mlxlogscore=999 bulkscore=0 spamscore=0 impostorscore=0 malwarescore=0 clxscore=1015 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2207040030 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656918550; a=rsa-sha256; cv=none; b=43NhRaZdXMWWQwthdNlJ4YuF7fI0V9fFQuLm+sJ0SE3pS2yvRh8/WzZc42rjz3jVBKBNNv 7j6qLFeLGafUol850dcqVZu3fV1nrBY1TnvwcXEP4UePhcg5bsSiba6VwMcd0x0P/anfEI E6t04UPLnliVg4FEZA7iwo9BRFi5qXY= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=ac4TdUho; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf11.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656918550; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=NGZ2uPuNeujMrax+f9oOnAHVXQVZnbiYUjZzi+sV9iE=; b=uOMBGRAi2Up3mEih2A1TWDLON2FUaNIh10lgKSAKJRDZxCDLIGtTT9ZI3KfWAF9g/BWtbp A7aoAWTZrk3Vtr5Aekd+67XabFjTVOnzWzL6/LTmCjFgaqGQEZPiIY+WASZlYRY2MAhdgX w0ilKVpbTbhYufCZM2XMPEFA6l5xBdE= Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=ac4TdUho; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf11.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspam-User: X-Stat-Signature: pfbdjjuc7c78oyppt8ozme43daiqexjo X-Rspamd-Queue-Id: D8806400C8 X-Rspamd-Server: rspam04 X-HE-Tag: 1656918549-867855 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya All N_MEMORY nodes are divided into 3 memory tiers with tier ID value MEMORY_TIER_HBM_GPU, MEMORY_TIER_DRAM and MEMORY_TIER_PMEM. By default, all nodes are assigned to default memory tier (MEMORY_TIER_DRAM). Demotion path for all N_MEMORY nodes is prepared based on the tier ID value of memory tiers. This patch adds documention for memory tiering introduction, its sysfs interfaces and how demotion is performed based on memory tiers. [update doc format by Bagas Sanjaya ] Suggested-by: Wei Xu Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- Documentation/admin-guide/mm/index.rst | 1 + .../admin-guide/mm/memory-tiering.rst | 192 ++++++++++++++++++ 2 files changed, 193 insertions(+) create mode 100644 Documentation/admin-guide/mm/memory-tiering.rst diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index c21b5823f126..3f211cbca8c3 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -32,6 +32,7 @@ the Linux memory management. idle_page_tracking ksm memory-hotplug + memory-tiering nommu-mmap numa_memory_policy numaperf diff --git a/Documentation/admin-guide/mm/memory-tiering.rst b/Documentation/admin-guide/mm/memory-tiering.rst new file mode 100644 index 000000000000..107599dbc952 --- /dev/null +++ b/Documentation/admin-guide/mm/memory-tiering.rst @@ -0,0 +1,192 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. _admin_guide_memory_tiering: + +============ +Memory tiers +============ + +This document describes explicit memory tiering support along with +demotion based on memory tiers. + +Introduction +============ + +Many systems have multiple types of memory devices e.g. GPU, DRAM and +PMEM. The memory subsystem of these systems can be called a memory +tiering system because the performance of the each types of +memory is different. Memory tiers are defined based on the hardware +capabilities of memory nodes. Each memory tier is assigned a tier ID +value that determines the memory tier position in demotion order. + +The memory tier assignment of each node is independent of each +other. Moving a node from one tier to another doesn't affect +the tier assignment of any other node. + +Memory tiers are used to build the demotion targets for nodes. A node +can demote its pages to any node of any lower tiers. + +Memory tier ID +================= + +Memory nodes are divided into 3 types of memory tiers with tier ID +value as shown based on their hardware characteristics. + + + * MEMORY_TIER_HBM_GPU + * MEMORY_TIER_DRAM + * MEMORY_TIER_PMEM + +Memory tiers initialization and (re)assignments +=============================================== + +By default, all nodes are assigned to the memory tier with the default tier ID +DEFAULT_MEMORY_TIER which is 200 (MEMORY_TIER_DRAM). The memory tier of +the memory node can be either modified through sysfs or from the driver. On +hotplug, the memory tier with default tier ID is assigned to the memory node. + + +Sysfs interfaces +================ + +Nodes belonging to specific tier can be read from, +/sys/devices/system/memtier/memtierN/nodelist (read-Only) + +Examples: + +1. On a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node and + node 2 is a PMEM node an ideal tier layout will be + + .. code-block:: sh + + $ cat /sys/devices/system/memtier/memtier0/nodelist + 1 + $ cat /sys/devices/system/memtier/memtier1/nodelist + 0 + $ cat /sys/devices/system/memtier/memtier2/nodelist + 2 + +2. On a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM + nodes. + + .. code-block:: sh + + $ cat /sys/devices/system/memtier/memtier0/nodelist + cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or directory + $ cat /sys/devices/system/memtier/memtier1/nodelist + 0-1 + $ cat /sys/devices/system/memtier/memtier2/nodelist + 2-3 + +Default memory tier can be read from, +/sys/devices/system/memtier/default_tier (read-Only) + + .. code-block:: sh + + $ cat /sys/devices/system/memtier/default_tier + memtier200 + +Max memory tier ID supported can be read from, +/sys/devices/system/memtier/max_tier (read-Only) + + .. code-block:: sh + + $ cat /sys/devices/system/memtier/max_tier + 400 + +Individual node's memory tier can be read of set using, +/sys/devices/system/node/nodeN/memtier (read-write), where N = node id + +When this interface is written, node is moved from the old memory tier +to new memory tier and demotion targets for all N_MEMORY nodes are +built again. + +For example 1 mentioned above, + .. code-block:: sh + + $ cat /sys/devices/system/node/node0/memtier + 1 + $ cat /sys/devices/system/node/node1/memtier + 0 + $ cat /sys/devices/system/node/node2/memtier + 2 + +Additional memory tiers can be created by writing a tier ID value to this file. +This results in a new memory tier creation and moving the specific NUMA node to +that memory tier. + +Demotion +======== + +In a system with DRAM and persistent memory, once DRAM +fills up, reclaim will start and some of the DRAM contents will be +thrown out even if there is a space in persistent memory. +Consequently, allocations will, at some point, start falling over to the slower +persistent memory. + +That has two nasty properties. First, the newer allocations can end up in +the slower persistent memory. Second, reclaimed data in DRAM are just +discarded even if there are gobs of space in persistent memory that could +be used. + +Instead of a page being discarded during reclaim, it can be moved to +persistent memory. Allowing page migration during reclaim enables +these systems to migrate pages from fast (higher) tiers to slow (lower) +tiers when the fast (higher) tier is under pressure. + + +Enable/Disable demotion +----------------------- + +By default demotion is disabled, it can be enabled/disabled using +below sysfs interface, + + .. code-block:: sh + + $ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled + +preferred and allowed demotion nodes +------------------------------------ + +Preferred nodes for a specific N_MEMORY node are the best nodes +from the next possible lower memory tier. Allowed nodes for any +node are all the nodes available in all possible lower memory +tiers. + +For example on a system where Node 0 & 1 are CPU + DRAM nodes, +node 2 & 3 are PMEM nodes, + + * node distances: + + ==== == == == == + node 0 1 2 3 + ==== == == == == + 0 10 20 30 40 + 1 20 10 40 30 + 2 30 40 10 40 + 3 40 30 40 10 + ==== == == == == + + + .. code-block:: none + + memory_tiers[0] = + memory_tiers[1] = 0-1 + memory_tiers[2] = 2-3 + + node_demotion[0].preferred = 2 + node_demotion[0].allowed = 2, 3 + node_demotion[1].preferred = 3 + node_demotion[1].allowed = 3, 2 + node_demotion[2].preferred = + node_demotion[2].allowed = + node_demotion[3].preferred = + node_demotion[3].allowed = + +Memory allocation for demotion +------------------------------ + +If a page needs to be demoted from any node, the kernel first tries +to allocate a new page from the node's preferred node and fallbacks to +node's allowed targets in allocation fallback order. + From patchwork Mon Jul 4 07:06:12 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12904768 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CDB5BC433EF for ; Mon, 4 Jul 2022 07:09:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5E0B26B0074; Mon, 4 Jul 2022 03:09:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 590B0900004; Mon, 4 Jul 2022 03:09:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 45888900003; Mon, 4 Jul 2022 03:09:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 358F36B0074 for ; Mon, 4 Jul 2022 03:09:20 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 9C70661F8C for ; Mon, 4 Jul 2022 07:09:16 +0000 (UTC) X-FDA: 79648541112.02.9AB7A5B Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf10.hostedemail.com (Postfix) with ESMTP id 2BBA1C00C6 for ; Mon, 4 Jul 2022 07:09:15 +0000 (UTC) Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2644I0xR009812; Mon, 4 Jul 2022 07:09:07 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=wnJ02dPkWpDDGkvE7OaMnTSRCUZ/AidjIhYP6T2x91c=; b=Ub27WP/HbjQ2s5MNDk6ivID0OQCgPdwfNzhnoiWHNkVZhpTef02vxe5ikNHRcxeXHoCw /D51QIvUyQL/n1jhRQ7oWEqVH8BLj+uFkIq3gcu8eTAZ12Vq4/1TPhiiN2S0c71vHtF/ CGcGz83+4/f0oXFfFycGalt8t2bhnCKshkn9evXvIvq1Cy1kKqmGBVIxXxor4iHdWnsf Bd9d+YJPVxSoHakq7UcqXD8fLI84XWkonRdLYV//5MBj7N0Flyy41JhLSOwSc4cftcHd m2s0D590InLVEcy+27xA6Xt7W0hLkdWgs6HqVauqVdzVgf9G/UwnmergcNi7C+PBMwnC ag== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3n2kyjk0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:09:07 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2646CbEr016057; Mon, 4 Jul 2022 07:09:07 GMT Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3h3n2kyjj8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:09:06 +0000 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26475QEA030701; Mon, 4 Jul 2022 07:09:05 GMT Received: from b01cxnp23032.gho.pok.ibm.com (b01cxnp23032.gho.pok.ibm.com [9.57.198.27]) by ppma04wdc.us.ibm.com with ESMTP id 3h2dn9b4m2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 Jul 2022 07:09:05 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp23032.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 264795bK29688228 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 4 Jul 2022 07:09:05 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1BD1B124053; Mon, 4 Jul 2022 07:09:05 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B9A89124052; Mon, 4 Jul 2022 07:08:58 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.74.198]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Mon, 4 Jul 2022 07:08:58 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v8 12/12] mm/demotion: Add sysfs ABI documentation Date: Mon, 4 Jul 2022 12:36:12 +0530 Message-Id: <20220704070612.299585-13-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> References: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: gFkjoQ9kYnarDPTZN74KzO8drIqPM-Au X-Proofpoint-ORIG-GUID: 2qHVjEUxP0p1ktcTGHzKnDujegealXIJ X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-04_05,2022-06-28_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 impostorscore=0 bulkscore=0 lowpriorityscore=0 suspectscore=0 spamscore=0 mlxscore=0 priorityscore=1501 phishscore=0 mlxlogscore=999 adultscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2207040030 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656918556; a=rsa-sha256; cv=none; b=5+zq/JqG3xqkrhlgMpw6WqZOQGGFh5gG8plNVHymj4kEBKjmWpYFSsVUw4faMQ9P0wX9rp L9rcHHlCkLMV1jwpgrAIFucod95LyAb2CjusdA/Za5JbP7JxlT8xGXWnhEGwCK0sSvM3ty 97fBBoR2dFakVGRuYDQa4Od+hADJ3XY= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="Ub27WP/H"; spf=pass (imf10.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656918556; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wnJ02dPkWpDDGkvE7OaMnTSRCUZ/AidjIhYP6T2x91c=; b=ROAal4v8zuDfUN1kdahbBclx83MPrqkXq1DusQHT5iErxFNkA0XWAe5ecNIbpwtQtjqFIb npKHgvqnuwkIrEY8HtBR9z1e5XGBh0dGrApgDpnnqIBAnK1pMxalCePi73EC7kQkUHokTg AFW7GXVqOD3VWpjt1lGkZXM/3Cz5C3g= X-Stat-Signature: d73cjyjaqu77j4nq6d6zqhppuybut9w7 X-Rspamd-Queue-Id: 2BBA1C00C6 Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="Ub27WP/H"; spf=pass (imf10.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1656918555-837184 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add sysfs ABI documentation. Signed-off-by: Wei Xu Signed-off-by: Aneesh Kumar K.V --- .../ABI/testing/sysfs-kernel-mm-memory-tiers | 61 +++++++++++++++++++ 1 file changed, 61 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers b/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers new file mode 100644 index 000000000000..843fb59d2f3d --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers @@ -0,0 +1,61 @@ +What: /sys/devices/system/memtier/ +Date: June 2022 +Contact: Linux memory management mailing list +Description: Interface for tiered memory + + This is the directory containing the information about memory tiers. + + Each memory tier has its own subdirectory. + + The order of memory tiers is determined by their tier ID value. + A higher tier ID value means a higher tier. memtier300 is higher + memory tier compared to memtier 100. + +What: /sys/devices/system/memtier/default_tier +Date: June 2022 +Contact: Linux memory management mailing list +Description: Default memory tier + + The default memory tier to which memory would get added via hotplug + if the NUMA node is not part of any memory tier + +What: /sys/devices/system/memtier/max_tier +Date: June 2022 +Contact: Linux memory management mailing list +Description: Maximum memory tier ID supported + + The max memory tier device ID we can create. Users can create memory + tiers in range [0 - max_tier] + +What: /sys/devices/system/memtier/memtierN/ +Date: June 2022 +Contact: Linux memory management mailing list +Description: Directory with details of a specific memory tier + + This is the directory containing the information about a particular + memory tier, memtierN, where N is the memtier device ID (e.g. 0, 1). + + The memtier device ID number itself is just an identifier and has no + special meaning. Its value relative to other memtiers decides the level + of this memtier in the tier hierarchy. + + +What: /sys/devices/system/memtier/memtierN/nodelist +Date: June 2022 +Contact: Linux memory management mailing list +Description: Memory tier nodelist + + + When read, list the memory nodes in the specified tier. + +What: /sys/devices/system/node/nodeN/memtier +Date: June 2022 +Contact: Linux memory management mailing list +Description: Memory tier details for node N + + When read, list the device ID of the memory tier that the node belongs + to. Its value is empty for a CPU-only NUMA node. + + When written, the kernel moves the node into the specified memory + tier if the move is allowed. The tier assignments of all other + nodes are not affected.