From patchwork Fri Jun 10 13:52:17 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12877621 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9CADC43334 for ; Fri, 10 Jun 2022 13:53:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5C0E18D00A4; Fri, 10 Jun 2022 09:53:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 548E88D009C; Fri, 10 Jun 2022 09:53:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 375298D00A4; Fri, 10 Jun 2022 09:53:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 21F328D009C for ; Fri, 10 Jun 2022 09:53:20 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id F273B12030E for ; Fri, 10 Jun 2022 13:53:19 +0000 (UTC) X-FDA: 79562468118.08.B2890ED Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf10.hostedemail.com (Postfix) with ESMTP id 73F83C004C for ; Fri, 10 Jun 2022 13:53:19 +0000 (UTC) Received: from pps.filterd (m0098393.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25ADgenT000506; Fri, 10 Jun 2022 13:53:00 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-transfer-encoding : mime-version; s=pp1; bh=szGl0zKm3QjDi8xSMZ5Y2+UVCIioP7aBYdmHCnk5z9I=; b=V5SYUQWewXyu6GuMgWVeIx5sLQUisw3NTR+BDnusUr3M0Wmwjee4hIP4UvqZRJ5FKJIx HHBTKM7TeOREZzF9/23a/Y4G9qUM6Avs0MeNaOdxsIHScjesd+Ve1WCsmsQ2+nbKC/H3 HvquJmnnWg/BLOqJ7VHd71kiv0FqahPIKvWSlJczF+UIOqfMgKMmHz3oiYskyuuvdJPT Xpgep6nMRCwkd6oPth8NIVnjm257IVq1CCI0sFJE5ClieQ2mgNmueqkKrYqTLR5j8VZQ cirjpwTwjEKOO24r2rvZnJlkN/4V84xx2xvus4RpTo3cownmAbTQm34sRRNg8++y2xmr aA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm72vr5xq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:00 +0000 Received: from m0098393.ppops.net (m0098393.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25ADhu27002648; Fri, 10 Jun 2022 13:52:59 GMT Received: from ppma04dal.us.ibm.com (7a.29.35a9.ip4.static.sl-reverse.com [169.53.41.122]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm72vr5x9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:52:59 +0000 Received: from pps.filterd (ppma04dal.us.ibm.com [127.0.0.1]) by ppma04dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25ADZTst010094; Fri, 10 Jun 2022 13:52:58 GMT Received: from b03cxnp08026.gho.boulder.ibm.com (b03cxnp08026.gho.boulder.ibm.com [9.17.130.18]) by ppma04dal.us.ibm.com with ESMTP id 3gfy1au73d-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:52:58 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp08026.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25ADqvng23790058 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Jun 2022 13:52:57 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0CF136A05A; Fri, 10 Jun 2022 13:52:57 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0A2AA6A047; Fri, 10 Jun 2022 13:52:49 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.90.151]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 10 Jun 2022 13:52:48 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v6 01/13] mm/demotion: Add support for explicit memory tiers Date: Fri, 10 Jun 2022 19:22:17 +0530 Message-Id: <20220610135229.182859-2-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> References: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: XEyqjwiRh_dxJGpK6b_cU4Wvb03wdeD3 X-Proofpoint-GUID: dr5aKRfieEzUnMQ4lFBxBhumwV1XFDyL X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-10_06,2022-06-09_02,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 spamscore=0 malwarescore=0 bulkscore=0 impostorscore=0 clxscore=1015 priorityscore=1501 mlxlogscore=999 mlxscore=0 adultscore=0 lowpriorityscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206100052 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654869199; a=rsa-sha256; cv=none; b=Kh+GyYODV0fwxU9jtaotmmTLwX77p67bResIoA580jSIr96dYqPFspavbCaPJhff1SeebY fg7hCLCbr+bhzUMCgSJ/CGNMdtvZP/xt72V9j0e2GuDSdlh7zZwWtTCISuGgtQ60oLKo5Q N9nnYrW3zyGsUWnXzbZ/TaEqXThWLvk= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=V5SYUQWe; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf10.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654869199; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=szGl0zKm3QjDi8xSMZ5Y2+UVCIioP7aBYdmHCnk5z9I=; b=KOiZK/+d51+qUl/ehDVcopAAURRyKyb8PKGC0/55GUU7tl7FjyL9CALM9kRBI1h+SvG7Ao 5dsRfS/oz7y+ZWewkNoWcAbBYbSyGnQSy+ZYfpW7seGCCO1HLob6sww8SLFHEzjGe9Ic2v LSFEfsHF7sI8eBqI6t/DhP0uaWV5eEI= X-Rspamd-Queue-Id: 73F83C004C X-Rspam-User: Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=V5SYUQWe; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf10.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Stat-Signature: 5qkrq8jnz3cyt94f3t1g7z5n33e3kdst X-Rspamd-Server: rspam02 X-HE-Tag: 1654869199-210347 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In the current kernel, memory tiers are defined implicitly via a demotion path relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the top tier, and builds the tier hierarchy tier-by-tier by establishing the per-node demotion targets based on the distances between nodes. This current memory tier kernel interface needs to be improved for several important use cases, The current tier initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM device attached via CXL.mem or a DRAM-backed memory-only node on a virtual machine) and should be put into a higher tier. The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier. With current kernel higher tier node can only be demoted to selected nodes on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, hard-coded demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that. The current kernel also don't provide any interfaces for the userspace to learn about the memory tier hierarchy in order to optimize its memory allocations. This patch series address the above by defining memory tiers explicitly. This patch introduce explicity memory tiers with ranks. The rank value of a memory tier is used to derive the demotion order between NUMA nodes. The memory tiers present in a system can be found at "Rank" is an opaque value. Its absolute value doesn't have any special meaning. But the rank values of different memtiers can be compared with each other to determine the memory tier order. For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and their rank values are 300, 200, 100, then the memory tier order is: memtier0 -> memtier1 -> memtier2, where memtier0 is the highest tier and memtier2 is the lowest tier. The rank value of each memtier should be unique. A higher rank memory tier will appear first in the demotion order than a lower rank memory tier. ie. while reclaim we choose a node in higher rank memory tier to demote pages to as compared to a node in a lower rank memory tier. This patchset introduce 3 memory tiers (memtier0, memtier1 and memtier2) which are created by different kernel subsystems. The default memory tier created by the kernel is memtier1. Once created these memory tiers are not destroyed even if they don't have any NUMA nodes assigned to them. This patch is based on the proposal sent by Wei Xu at [1]. [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com /sys/devices/system/memtier/memtierN/ The nodes which are part of a specific memory tier can be listed via /sys/devices/system/memtier/memtierN/nodelist Suggested-by: Wei Xu Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 20 ++++++++ mm/Kconfig | 3 ++ mm/Makefile | 1 + mm/memory-tiers.c | 89 ++++++++++++++++++++++++++++++++++++ 4 files changed, 113 insertions(+) create mode 100644 include/linux/memory-tiers.h create mode 100644 mm/memory-tiers.c diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h new file mode 100644 index 000000000000..e17f6b4ee177 --- /dev/null +++ b/include/linux/memory-tiers.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMORY_TIERS_H +#define _LINUX_MEMORY_TIERS_H + +#ifdef CONFIG_TIERED_MEMORY + +#define MEMORY_TIER_HBM_GPU 0 +#define MEMORY_TIER_DRAM 1 +#define MEMORY_TIER_PMEM 2 + +#define MEMORY_RANK_HBM_GPU 300 +#define MEMORY_RANK_DRAM 200 +#define MEMORY_RANK_PMEM 100 + +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM +#define MAX_MEMORY_TIERS 3 + +#endif /* CONFIG_TIERED_MEMORY */ + +#endif diff --git a/mm/Kconfig b/mm/Kconfig index 169e64192e48..bb5aa585ab41 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -614,6 +614,9 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION config ARCH_ENABLE_THP_MIGRATION bool +config TIERED_MEMORY + def_bool NUMA + config HUGETLB_PAGE_SIZE_VARIABLE def_bool n help diff --git a/mm/Makefile b/mm/Makefile index 6f9ffa968a1a..482557fbc9d1 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c new file mode 100644 index 000000000000..d9fa955f208e --- /dev/null +++ b/mm/memory-tiers.c @@ -0,0 +1,89 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include + +struct memory_tier { + struct list_head list; + nodemask_t nodelist; + int id; + int rank; +}; + +static DEFINE_MUTEX(memory_tier_lock); +static LIST_HEAD(memory_tiers); + +/* + * Keep it simple by having direct mapping between + * tier index and rank value. + */ +static inline int get_rank_from_tier(unsigned int tier) +{ + switch (tier) { + case MEMORY_TIER_HBM_GPU: + return MEMORY_RANK_HBM_GPU; + case MEMORY_TIER_DRAM: + return MEMORY_RANK_DRAM; + case MEMORY_TIER_PMEM: + return MEMORY_RANK_PMEM; + } + return -1; +} + +static void insert_memory_tier(struct memory_tier *memtier) +{ + struct list_head *ent; + struct memory_tier *tmp_memtier; + + list_for_each(ent, &memory_tiers) { + tmp_memtier = list_entry(ent, struct memory_tier, list); + if (tmp_memtier->rank < memtier->rank) { + list_add_tail(&memtier->list, ent); + return; + } + } + list_add_tail(&memtier->list, &memory_tiers); +} + +static struct memory_tier *register_memory_tier(unsigned int tier, + unsigned int rank) +{ + struct memory_tier *memtier; + + if (tier >= MAX_MEMORY_TIERS) + return ERR_PTR(-EINVAL); + + memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!memtier) + return ERR_PTR(-ENOMEM); + + memtier->id = tier; + memtier->rank = rank; + + insert_memory_tier(memtier); + + return memtier; +} + +static int __init memory_tier_init(void) +{ + struct memory_tier *memtier; + + /* + * Register only default memory tier to hide all empty + * memory tier from sysfs. + */ + memtier = register_memory_tier(DEFAULT_MEMORY_TIER, + get_rank_from_tier(DEFAULT_MEMORY_TIER)); + + if (IS_ERR(memtier)) + panic("%s() failed to register memory tier: %ld\n", + __func__, PTR_ERR(memtier)); + + /* CPU only nodes are not part of memory tiers. */ + memtier->nodelist = node_states[N_MEMORY]; + + return 0; +} +subsys_initcall(memory_tier_init); From patchwork Fri Jun 10 13:52:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12877622 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DC377C43334 for ; Fri, 10 Jun 2022 13:53:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 794EB8D00A5; Fri, 10 Jun 2022 09:53:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 71D678D009C; Fri, 10 Jun 2022 09:53:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5498A8D00A5; Fri, 10 Jun 2022 09:53:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 441938D009C for ; Fri, 10 Jun 2022 09:53:25 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 15936602B3 for ; Fri, 10 Jun 2022 13:53:25 +0000 (UTC) X-FDA: 79562468370.02.75F4C6F Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf04.hostedemail.com (Postfix) with ESMTP id 8B2114006C for ; Fri, 10 Jun 2022 13:53:24 +0000 (UTC) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25ADT6cb015200; Fri, 10 Jun 2022 13:53:09 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=rI84I7wspvHHqUf4pjJ7TpdsDGLmC4j2TgOSa1F1baQ=; b=EFjgkU8KMzihH97djhIzOWBxLxoI8o4dKTKbb8AIY8xAZynAetU4cmc9XVfCMxJV3HE2 AMpfAvGyiFRo4DRJQZKKiOGramnB1FiasM1+1P08td/6yMhRyv16cCGvFw/8dzs6M1zx i3IZLNInIHHJan3S8WBUQ0oR+Icw58kT+9EmzryW0LaOCAiqmvpueSKRtp5Eq5VurZT+ 2prFsr8mLddE2C+Ue0oab/H2XMUvPU2XyVMqe51VFoaURhZCEOCMWnsOOdQL56acagy1 sS9Q6qu1+xeY76uhSLtsi7YIi2N2RJCRthIWsoYfwSekYLFRgCZRpREW7FZtBu0jffr/ /g== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6vn0h0j-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:09 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25ADprNp009760; Fri, 10 Jun 2022 13:53:09 GMT Received: from ppma04dal.us.ibm.com (7a.29.35a9.ip4.static.sl-reverse.com [169.53.41.122]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6vn0h04-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:08 +0000 Received: from pps.filterd (ppma04dal.us.ibm.com [127.0.0.1]) by ppma04dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25ADZWpC010128; Fri, 10 Jun 2022 13:53:07 GMT Received: from b03cxnp08028.gho.boulder.ibm.com (b03cxnp08028.gho.boulder.ibm.com [9.17.130.20]) by ppma04dal.us.ibm.com with ESMTP id 3gfy1au748-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:07 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp08028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25ADr6G444171672 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Jun 2022 13:53:06 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 513376A04D; Fri, 10 Jun 2022 13:53:06 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id CAA816A04F; Fri, 10 Jun 2022 13:52:57 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.90.151]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 10 Jun 2022 13:52:57 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v6 02/13] mm/demotion: Move memory demotion related code Date: Fri, 10 Jun 2022 19:22:18 +0530 Message-Id: <20220610135229.182859-3-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> References: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: z0V8WEKAxWuBD3oT9NEWGJvVLMNZHA0K X-Proofpoint-ORIG-GUID: 4_p3rNHBm226WvrDM59SFCuIvBhtyo1e X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-10_06,2022-06-09_02,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 mlxlogscore=999 mlxscore=0 adultscore=0 spamscore=0 priorityscore=1501 phishscore=0 bulkscore=0 suspectscore=0 impostorscore=0 lowpriorityscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206100056 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654869204; a=rsa-sha256; cv=none; b=rpOFBf3/eF6H7ehAd4MW6VpXwDIiYIYCHwDJGQqv827xxBmil4RFfGsNRzeLw8hdzkDcXs aeifRLwt2iltWrb1LzdXtiEvD7V45+qhU75TBurmg97ySbe7OHZNtKMBIkI6f7UWx0OKIi XbySKW90a1eiBulVFmCCzoSm8tn0xE8= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=EFjgkU8K; spf=pass (imf04.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654869204; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=rI84I7wspvHHqUf4pjJ7TpdsDGLmC4j2TgOSa1F1baQ=; b=PyxQW2tyfR8A4FeAy+au3AxXRX44DLciNx/CEqPXUZZ3BEETsK2RAfx4bA7SIwvQEEDrBv plfI2aC3QJ8d3y8h+ckg3vU0J78FNS4ojRuOrldUxv8AABw1HdGYb0sDWwv5FAwKC5bCoV rmakPoVwPKC9jdRiLy/Xn12xOdodX1Q= X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 8B2114006C X-Stat-Signature: qf37gruqjhrgrpmu8f85qfub7uruynfw X-Rspam-User: Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=EFjgkU8K; spf=pass (imf04.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-HE-Tag: 1654869204-252355 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This move memory demotion related code to mm/memory-tiers.c. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 6 ++++ include/linux/migrate.h | 2 -- mm/memory-tiers.c | 61 ++++++++++++++++++++++++++++++++++++ mm/migrate.c | 60 +---------------------------------- mm/vmscan.c | 1 + 5 files changed, 69 insertions(+), 61 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index e17f6b4ee177..44c3c3b16a36 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -2,6 +2,8 @@ #ifndef _LINUX_MEMORY_TIERS_H #define _LINUX_MEMORY_TIERS_H +#include + #ifdef CONFIG_TIERED_MEMORY #define MEMORY_TIER_HBM_GPU 0 @@ -15,6 +17,10 @@ #define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM #define MAX_MEMORY_TIERS 3 +extern bool numa_demotion_enabled; +#else +#define numa_demotion_enabled false + #endif /* CONFIG_TIERED_MEMORY */ #endif diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 069a89e847f3..43e737215f33 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -78,7 +78,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) extern void set_migration_target_nodes(void); extern void migrate_on_reclaim_init(void); -extern bool numa_demotion_enabled; extern int next_demotion_node(int node); #else static inline void set_migration_target_nodes(void) {} @@ -87,7 +86,6 @@ static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } -#define numa_demotion_enabled false #endif #ifdef CONFIG_COMPACTION diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index d9fa955f208e..9c6b40d7e0bf 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #include +#include #include #include #include @@ -87,3 +88,63 @@ static int __init memory_tier_init(void) return 0; } subsys_initcall(memory_tier_init); + +bool numa_demotion_enabled = false; + +#ifdef CONFIG_SYSFS +static ssize_t numa_demotion_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%s\n", + numa_demotion_enabled ? "true" : "false"); +} + +static ssize_t numa_demotion_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + ssize_t ret; + + ret = kstrtobool(buf, &numa_demotion_enabled); + if (ret) + return ret; + + return count; +} + +static struct kobj_attribute numa_demotion_enabled_attr = + __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, + numa_demotion_enabled_store); + +static struct attribute *numa_attrs[] = { + &numa_demotion_enabled_attr.attr, + NULL, +}; + +static const struct attribute_group numa_attr_group = { + .attrs = numa_attrs, +}; + +static int __init numa_init_sysfs(void) +{ + int err; + struct kobject *numa_kobj; + + numa_kobj = kobject_create_and_add("numa", mm_kobj); + if (!numa_kobj) { + pr_err("failed to create numa kobject\n"); + return -ENOMEM; + } + err = sysfs_create_group(numa_kobj, &numa_attr_group); + if (err) { + pr_err("failed to register numa group\n"); + goto delete_obj; + } + return 0; + +delete_obj: + kobject_put(numa_kobj); + return err; +} +subsys_initcall(numa_init_sysfs); +#endif /* CONFIG_SYSFS */ diff --git a/mm/migrate.c b/mm/migrate.c index e51588e95f57..29cacc217e38 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2508,64 +2508,6 @@ void __init migrate_on_reclaim_init(void) set_migration_target_nodes(); cpus_read_unlock(); } +#endif /* CONFIG_NUMA */ -bool numa_demotion_enabled = false; - -#ifdef CONFIG_SYSFS -static ssize_t numa_demotion_enabled_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%s\n", - numa_demotion_enabled ? "true" : "false"); -} - -static ssize_t numa_demotion_enabled_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) -{ - ssize_t ret; - - ret = kstrtobool(buf, &numa_demotion_enabled); - if (ret) - return ret; - - return count; -} - -static struct kobj_attribute numa_demotion_enabled_attr = - __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, - numa_demotion_enabled_store); - -static struct attribute *numa_attrs[] = { - &numa_demotion_enabled_attr.attr, - NULL, -}; - -static const struct attribute_group numa_attr_group = { - .attrs = numa_attrs, -}; - -static int __init numa_init_sysfs(void) -{ - int err; - struct kobject *numa_kobj; - numa_kobj = kobject_create_and_add("numa", mm_kobj); - if (!numa_kobj) { - pr_err("failed to create numa kobject\n"); - return -ENOMEM; - } - err = sysfs_create_group(numa_kobj, &numa_attr_group); - if (err) { - pr_err("failed to register numa group\n"); - goto delete_obj; - } - return 0; - -delete_obj: - kobject_put(numa_kobj); - return err; -} -subsys_initcall(numa_init_sysfs); -#endif /* CONFIG_SYSFS */ -#endif /* CONFIG_NUMA */ diff --git a/mm/vmscan.c b/mm/vmscan.c index f7d9a683e3a7..3a8f78277f99 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -50,6 +50,7 @@ #include #include #include +#include #include #include From patchwork Fri Jun 10 13:52:19 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12877623 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 44B5BC433EF for ; Fri, 10 Jun 2022 13:53:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CC6438D00A6; Fri, 10 Jun 2022 09:53:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C4F928D009C; Fri, 10 Jun 2022 09:53:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AA0988D00A6; Fri, 10 Jun 2022 09:53:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 97E2C8D009C for ; Fri, 10 Jun 2022 09:53:29 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id 78E91120364 for ; Fri, 10 Jun 2022 13:53:29 +0000 (UTC) X-FDA: 79562468538.17.964ECE5 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf24.hostedemail.com (Postfix) with ESMTP id 0AA6918006B for ; Fri, 10 Jun 2022 13:53:28 +0000 (UTC) Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25ADY1Ng023578; Fri, 10 Jun 2022 13:53:19 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=yPR26Ihb5In/eBRJL5MvPvXGtjwasSeSjglyfb726gs=; b=ZKHw0n+smFHRl03cPO2F3oz5SOGE4UnVoSB8UQVZHtkRY6hA7vVFcvDHyK8Db2Aalnd1 F8kZ80N2zxGSdINtk0cIEJj87krEZLn/RZoEPvj4lioZgqPnUfxwKzc6tm7CuFqoQuVe f+p30c//uf8SkEitMaqX+lRz28fEjq3oXfUP9TqGl3RHToKtXVrWBuWJcRP4Cl0Z7FVF b6UiK59RgQkfTvUS1CZs65sWS30WmVevxXHNoEooOSkZuoV3H8Hdo/A9y1NqjhUsPvPa qBOrm5q3s/1GYRHdgxXXP+NvOHpaKdKms8BfhT9QL0GkytzDeJ/IdrumjVxWtVV1SPXs Lw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6y18c63-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:19 +0000 Received: from m0098421.ppops.net (m0098421.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25ADYL7r024660; Fri, 10 Jun 2022 13:53:18 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6y18c5r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:18 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25ADajLP022273; Fri, 10 Jun 2022 13:53:17 GMT Received: from b03cxnp07029.gho.boulder.ibm.com (b03cxnp07029.gho.boulder.ibm.com [9.17.130.16]) by ppma01dal.us.ibm.com with ESMTP id 3gfy1au6xr-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:17 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp07029.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25ADrGAY29688300 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Jun 2022 13:53:16 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id EA9946A04D; Fri, 10 Jun 2022 13:53:15 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 244A16A047; Fri, 10 Jun 2022 13:53:07 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.90.151]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 10 Jun 2022 13:53:06 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v6 03/13] mm/demotion: Return error on write to numa_demotion sysfs Date: Fri, 10 Jun 2022 19:22:19 +0530 Message-Id: <20220610135229.182859-4-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> References: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: drQcqWUdOmdTlhuQbKaznMGhlmFljSWF X-Proofpoint-GUID: Pngpu4I6nekQyZNkdZTgN9KAaEuAoPew X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-10_06,2022-06-09_02,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=0 spamscore=0 bulkscore=0 malwarescore=0 impostorscore=0 lowpriorityscore=0 phishscore=0 mlxlogscore=999 mlxscore=0 adultscore=0 priorityscore=1501 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206100052 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654869209; a=rsa-sha256; cv=none; b=UjTPy1+/dFCPqKxWgQMGjubBJEbsjA4/S8MY1RdziRKYt9qbqtyJxWjSENAKKC+6tXNBHt l9lBW3B3lNaIO5zX0A6SOmtvvNA4atjVaMPLM6SjC0JdQwRaGp8SXsoNpfa2TGnmluherW rYb2AsUP6kIgMIyw3P8/qr5MY/GZbr8= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=ZKHw0n+s; spf=pass (imf24.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654869209; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yPR26Ihb5In/eBRJL5MvPvXGtjwasSeSjglyfb726gs=; b=tUJe+YbnlWXq19kfv8Tt00r3Zz84yLVN9DVz19anRdC/WvtTjVm5dWe2M2h52n4fN9/l6K 3tEuB02Kd40R+A+JCXAqc2zYJNMTyU6TF+n5PTC/ZDAj6bC1d+sRkPNzXxuMMrq9KNZl2k ZwypU73JCAMzNMVVk698grIsMRaWNPc= X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 0AA6918006B X-Stat-Signature: fwt9uu43ios9w11t73sr9faana66iozj X-Rspam-User: Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=ZKHw0n+s; spf=pass (imf24.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-HE-Tag: 1654869208-190716 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: With CONFIG_MIGRATION disabled return EINVAL on write. Signed-off-by: Aneesh Kumar K.V --- mm/memory-tiers.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 9c6b40d7e0bf..c3123a457d90 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -105,6 +105,9 @@ static ssize_t numa_demotion_enabled_store(struct kobject *kobj, { ssize_t ret; + if (!IS_ENABLED(CONFIG_MIGRATION)) + return -EINVAL; + ret = kstrtobool(buf, &numa_demotion_enabled); if (ret) return ret; From patchwork Fri Jun 10 13:52:20 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12877624 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E94C0C43334 for ; Fri, 10 Jun 2022 13:53:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7BD0F8D00A7; Fri, 10 Jun 2022 09:53:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 745F78D009C; Fri, 10 Jun 2022 09:53:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5980D8D00A7; Fri, 10 Jun 2022 09:53:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 475A68D009C for ; Fri, 10 Jun 2022 09:53:48 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id 22D9F120354 for ; Fri, 10 Jun 2022 13:53:48 +0000 (UTC) X-FDA: 79562469336.20.0283A62 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf24.hostedemail.com (Postfix) with ESMTP id 9EFD118006B for ; Fri, 10 Jun 2022 13:53:47 +0000 (UTC) Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25AD7h80030097; Fri, 10 Jun 2022 13:53:28 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=cr7oU8eRYwQT2y9i+Jcicbl+Lj5x+BxH7cJX0SuAD4g=; b=Ex0ifdz0s8jre+ogcBMoRlKpJAR3C8TRsLQ09FpbvhvfLZkyFbcosU8BLxbedfh6nBwj ukU4vMpWPKyVKh0VFAnAnTiLsLDG2eteTrLVcCxA2zsT1SnIomKWx8z5BLDLtvOmhIRS +HYVKvVe+8Zh5ZNZLFBLprAH7A6ufvRgjJj//1R6B9ECb2qAJcfKiiZWzP3HqQ+mAIje FWTGmM16KvD+gn4aauk0QPjP/ZmRvkUANylMJXS4VdYL41KtGfU2cazLuEYuoe0+orwL fN3/EuN4xrD2tWRJb8pCNyenIJdLKnEjCiEZI4m7wuSjB4lBiHnChu4LxOhN4NUlTOwQ DA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm4vaaxpx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:28 +0000 Received: from m0098396.ppops.net (m0098396.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25ADNxk4029470; Fri, 10 Jun 2022 13:53:27 GMT Received: from ppma02dal.us.ibm.com (a.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.10]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm4vaaxpm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:27 +0000 Received: from pps.filterd (ppma02dal.us.ibm.com [127.0.0.1]) by ppma02dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25ADZTm8014175; Fri, 10 Jun 2022 13:53:26 GMT Received: from b03cxnp08026.gho.boulder.ibm.com (b03cxnp08026.gho.boulder.ibm.com [9.17.130.18]) by ppma02dal.us.ibm.com with ESMTP id 3gfy1bb4he-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:26 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp08026.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25ADrPU328246490 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Jun 2022 13:53:25 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 11A656A047; Fri, 10 Jun 2022 13:53:25 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A86206A04D; Fri, 10 Jun 2022 13:53:16 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.90.151]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 10 Jun 2022 13:53:16 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v6 04/13] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Date: Fri, 10 Jun 2022 19:22:20 +0530 Message-Id: <20220610135229.182859-5-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> References: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: 1I7VJeRFTIOzEt8THFTv0rHuFQF9BBsv X-Proofpoint-GUID: 6rZIckOcuaQhy2dYIT0FVzzTXQjG5Xnm X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-10_06,2022-06-09_02,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 phishscore=0 priorityscore=1501 bulkscore=0 spamscore=0 malwarescore=0 adultscore=0 impostorscore=0 suspectscore=0 mlxlogscore=999 clxscore=1015 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206100056 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654869227; a=rsa-sha256; cv=none; b=yhKiv7xKXGJtq3MthIz93mgam/ORJVgOcyfiYbdyAgxqy517mxrUSbAjN6SfPL2pRrwY7v CwqHlBZVuGcm5dfTT/+xgMGc11wuCumxVQbEqARUrE9NPqMcUMCVo99n8NPopbsOQOWQ9h zfheHHy9OvJ4e4ArBL3CgD059tlZLMU= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654869227; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cr7oU8eRYwQT2y9i+Jcicbl+Lj5x+BxH7cJX0SuAD4g=; b=5nYA2WGZ9Tkwo85/YgKfh0+iDNys+CeS2PBV2r6N3Aq409tID4Ijl3YZY1291n0Ke1xfS3 X9f7TpnR7Zu9tnjzOIO4AC/GES3uD5Hd4fkB0CPoxpLZgFQMfT7CiRfFm89tW0xNXhxFGo f5DyoDVE4f48gug+HgKCfWL12nyQnPA= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=Ex0ifdz0; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf24.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=Ex0ifdz0; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf24.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Stat-Signature: 3557cmfoyqsqfjypauc6xjijy7a75iop X-Rspamd-Queue-Id: 9EFD118006B X-Rspamd-Server: rspam12 X-Rspam-User: X-HE-Tag: 1654869227-917599 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: By default, all nodes are assigned to DEFAULT_MEMORY_TIER which is the memory tier designated for nodes with DRAM Set dax kmem device node's tier to MEMORY_TIER_PMEM. MEMORY_TIER_PMEM is assigned a default rank value of 100 and appears below DEFAULT_MEMORY_TIER in demotion order. Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- drivers/dax/kmem.c | 4 ++ include/linux/memory-tiers.h | 1 + mm/memory-tiers.c | 78 ++++++++++++++++++++++++++++++++++++ 3 files changed, 83 insertions(+) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index a37622060fff..0cb3de3d138f 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -11,6 +11,7 @@ #include #include #include +#include #include "dax-private.h" #include "bus.h" @@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) dev_set_drvdata(dev, data); +#ifdef CONFIG_TIERED_MEMORY + node_create_and_set_memory_tier(numa_node, MEMORY_TIER_PMEM); +#endif return 0; err_request_mem: diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 44c3c3b16a36..e102ec73ab80 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -18,6 +18,7 @@ #define MAX_MEMORY_TIERS 3 extern bool numa_demotion_enabled; +int node_create_and_set_memory_tier(int node, int tier); #else #define numa_demotion_enabled false diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index c3123a457d90..00d393a5a628 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -67,6 +67,84 @@ static struct memory_tier *register_memory_tier(unsigned int tier, return memtier; } +static struct memory_tier *__node_get_memory_tier(int node) +{ + struct memory_tier *memtier; + + list_for_each_entry(memtier, &memory_tiers, list) { + if (node_isset(node, memtier->nodelist)) + return memtier; + } + return NULL; +} + +static struct memory_tier *__get_memory_tier_from_id(int id) +{ + struct memory_tier *memtier; + + list_for_each_entry(memtier, &memory_tiers, list) { + if (memtier->id == id) + return memtier; + } + return NULL; +} + +static int __node_create_and_set_memory_tier(int node, int tier) +{ + int ret = 0; + struct memory_tier *memtier; + + memtier = __get_memory_tier_from_id(tier); + if (!memtier) { + int rank; + + rank = get_rank_from_tier(tier); + if (rank == -1) { + ret = -EINVAL; + goto out; + } + memtier = register_memory_tier(tier, rank); + if (!memtier) { + ret = -EINVAL; + goto out; + } + } + node_set(node, memtier->nodelist); +out: + return ret; +} + +int node_create_and_set_memory_tier(int node, int tier) +{ + struct memory_tier *current_tier; + int ret = 0; + + mutex_lock(&memory_tier_lock); + + current_tier = __node_get_memory_tier(node); + if (!current_tier) { + ret = __node_create_and_set_memory_tier(node, tier); + goto out; + } + + if (current_tier->id == tier) + goto out; + + node_clear(node, current_tier->nodelist); + + ret = __node_create_and_set_memory_tier(node, tier); + if (ret) { + /* reset it back to older tier */ + node_set(node, current_tier->nodelist); + goto out; + } +out: + mutex_unlock(&memory_tier_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(node_create_and_set_memory_tier); + static int __init memory_tier_init(void) { struct memory_tier *memtier; From patchwork Fri Jun 10 13:52:21 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12877625 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80D99CCA47B for ; Fri, 10 Jun 2022 13:53:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1C98A8D00A8; Fri, 10 Jun 2022 09:53:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1791E8D009C; Fri, 10 Jun 2022 09:53:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EE6C88D00A8; Fri, 10 Jun 2022 09:53:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id D91338D009C for ; Fri, 10 Jun 2022 09:53:53 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id BE4B434AB0 for ; Fri, 10 Jun 2022 13:53:53 +0000 (UTC) X-FDA: 79562469546.10.7CD59F4 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf13.hostedemail.com (Postfix) with ESMTP id 1DFBE20066 for ; Fri, 10 Jun 2022 13:53:52 +0000 (UTC) Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25ADL5El010719; Fri, 10 Jun 2022 13:53:38 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=OfOtYtxNYGgQJvNxkt1FZF9nJGnGnNOp0iR/2yZPrFg=; b=pwKZBOLSNGPHsi3Bt5mXZE90k1sz5KiKhoneA+KEIY/PxthBC2BfoMRvLmjQIsm1HB6U EAAMMteoNLBlLydaWwT7US+QddIdDG5bWvbSseb6k0f37jovfcR5Amo+3bO21xF9a8Dm JlLZ8MxmWTzKThHJBaRJ+i/LqrEFHbDcOQ+8gkIfwm4yqnYtxGkDPpD7ydz9wi3xh2hl /qcbqP07FwEgubY29jrlxGZlUpusd5hOd34Tu/VbuUkFXAGZ8dxLMRgMIZQ7keMtduWs kJ4K/cfCSgJSXFk0abpwrTrsBv7UBA5vZrhPJ/asQYShF+kFr6yRCuOV6I7jhdcu4UCf zw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6rngkfu-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:38 +0000 Received: from m0098420.ppops.net (m0098420.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25ADRZWL004471; Fri, 10 Jun 2022 13:53:37 GMT Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6rngkfc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:37 +0000 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25ADZ831008986; Fri, 10 Jun 2022 13:53:36 GMT Received: from b03cxnp07027.gho.boulder.ibm.com (b03cxnp07027.gho.boulder.ibm.com [9.17.130.14]) by ppma03dal.us.ibm.com with ESMTP id 3gfy1bk6fj-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:36 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp07027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25ADrZVY28115384 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Jun 2022 13:53:35 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 020DC6A054; Fri, 10 Jun 2022 13:53:35 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D94166A047; Fri, 10 Jun 2022 13:53:25 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.90.151]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 10 Jun 2022 13:53:25 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v6 05/13] mm/demotion: Build demotion targets based on explicit memory tiers Date: Fri, 10 Jun 2022 19:22:21 +0530 Message-Id: <20220610135229.182859-6-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> References: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: b0ErQ_RSBOInVU2EOtrbD7crdc0FSa7F X-Proofpoint-ORIG-GUID: pP-xg5SqKFpu9VbLO-6tuPrJkbIiQzLV X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-10_06,2022-06-09_02,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 malwarescore=0 clxscore=1015 lowpriorityscore=0 phishscore=0 adultscore=0 mlxlogscore=999 impostorscore=0 mlxscore=0 spamscore=0 suspectscore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206100056 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654869233; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OfOtYtxNYGgQJvNxkt1FZF9nJGnGnNOp0iR/2yZPrFg=; b=DLMS9eVxbMA6XCZWhaxihOFNAngv8lyqfYXbu9Lub7afTPfIZssCbPAu8XIkQggwQVN/Da ju2x60a83+BVyuFE86QWkTEG450co/Klji7oKFsUvT+3qG8g60RN1igDnFrF1kFcaMxBWo vqa/7n3ANaInTNsrq+lAsE16HdP7nzc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654869233; a=rsa-sha256; cv=none; b=f3CmY5CL1u4wq9Pb6T7SoDMt9HIS78IWivQzijFzFtFwS6pU4tB6VFR17uu4SqFTwls/S0 mUHb4RTSeZt7RD5TmZg+2uRJXECD6xyOTdzGlwiZJnlSR0thXY1+U+sQpgv15TZF4AKXMw xYnPbGhKOCErggQB55uIX72Fo3oEamQ= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=pwKZBOLS; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf13.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=pwKZBOLS; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf13.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam08 X-Rspam-User: X-Stat-Signature: cyfxjsuczkomutadhr11ck6b1ym1r48o X-Rspamd-Queue-Id: 1DFBE20066 X-HE-Tag: 1654869232-691327 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch switch the demotion target building logic to use memory tiers instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the default tier 1 and additional memory tiers will be added by drivers like dax kmem. This patch builds the demotion target for a NUMA node by looking at all memory tiers below the tier to which the NUMA node belongs. The closest node in the immediately following memory tier is used as a demotion target. Since we are now only building demotion target for N_MEMORY NUMA nodes the CPU hotplug calls are removed in this patch. The rank approach allows us to keep memory tier device IDs stable even if there is a need to change the tier ordering among different memory tiers. e.g. DRAM nodes with CPUs will always be on memtier1, no matter how many tiers are higher or lower than these nodes. A new memory tier can be inserted into the tier hierarchy for a new set of nodes without affecting the node assignment of any existing memtier, provided that there is enough gap in the rank values for the new memtier. The absolute value of "rank" of a memtier doesn't necessarily carry any meaning. Its value relative to other memtiers decides the level of this memtier in the tier hierarchy. For now, This patch supports hardcoded rank values which are 300, 200, & 100 for memory tiers 0,1 & 2 respectively. Suggested-by: Wei Xu Signed-off-by: Aneesh Kumar K.V Below is the sysfs interface to read the rank values of memory tier, /sys/devices/system/memtier/memtierN/rank This interface is read only for now. Write support can be added when there is a need of flexibility of more number of memory tiers(> 3) with flexibile ordering requirement among them. --- include/linux/memory-tiers.h | 5 + include/linux/migrate.h | 13 -- mm/memory-tiers.c | 291 ++++++++++++++++++++++++++ mm/migrate.c | 394 ----------------------------------- mm/vmstat.c | 4 - 5 files changed, 296 insertions(+), 411 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index e102ec73ab80..18dd1ab7b96e 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -19,8 +19,13 @@ extern bool numa_demotion_enabled; int node_create_and_set_memory_tier(int node, int tier); +int next_demotion_node(int node); #else #define numa_demotion_enabled false +static inline int next_demotion_node(int node) +{ + return NUMA_NO_NODE; +} #endif /* CONFIG_TIERED_MEMORY */ diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 43e737215f33..93fab62e6548 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #endif /* CONFIG_MIGRATION */ -#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) -extern void set_migration_target_nodes(void); -extern void migrate_on_reclaim_init(void); -extern int next_demotion_node(int node); -#else -static inline void set_migration_target_nodes(void) {} -static inline void migrate_on_reclaim_init(void) {} -static inline int next_demotion_node(int node) -{ - return NUMA_NO_NODE; -} -#endif - #ifdef CONFIG_COMPACTION extern int PageMovable(struct page *page); extern void __SetPageMovable(struct page *page, struct address_space *mapping); diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 00d393a5a628..2f116912de43 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -4,6 +4,10 @@ #include #include #include +#include +#include + +#include "internal.h" struct memory_tier { struct list_head list; @@ -12,9 +16,76 @@ struct memory_tier { int rank; }; +struct demotion_nodes { + nodemask_t preferred; +}; + +static void establish_migration_targets(void); static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); +/* + * node_demotion[] examples: + * + * Example 1: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes. + * + * node distances: + * node 0 1 2 3 + * 0 10 20 30 40 + * 1 20 10 40 30 + * 2 30 40 10 40 + * 3 40 30 40 10 + * + * memory_tiers[0] = + * memory_tiers[1] = 0-1 + * memory_tiers[2] = 2-3 + * + * node_demotion[0].preferred = 2 + * node_demotion[1].preferred = 3 + * node_demotion[2].preferred = + * node_demotion[3].preferred = + * + * Example 2: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 30 + * 2 30 30 10 + * + * memory_tiers[0] = + * memory_tiers[1] = 0-2 + * memory_tiers[2] = + * + * node_demotion[0].preferred = + * node_demotion[1].preferred = + * node_demotion[2].preferred = + * + * Example 3: + * + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 40 + * 2 30 40 10 + * + * memory_tiers[0] = 1 + * memory_tiers[1] = 0 + * memory_tiers[2] = 2 + * + * node_demotion[0].preferred = 2 + * node_demotion[1].preferred = 0 + * node_demotion[2].preferred = + * + */ +static struct demotion_nodes *node_demotion __read_mostly; + /* * Keep it simple by having direct mapping between * tier index and rank value. @@ -124,6 +195,7 @@ int node_create_and_set_memory_tier(int node, int tier) current_tier = __node_get_memory_tier(node); if (!current_tier) { ret = __node_create_and_set_memory_tier(node, tier); + establish_migration_targets(); goto out; } @@ -138,6 +210,7 @@ int node_create_and_set_memory_tier(int node, int tier) node_set(node, current_tier->nodelist); goto out; } + establish_migration_targets(); out: mutex_unlock(&memory_tier_lock); @@ -145,6 +218,223 @@ int node_create_and_set_memory_tier(int node, int tier) } EXPORT_SYMBOL_GPL(node_create_and_set_memory_tier); +static int __node_set_memory_tier(int node, int tier) +{ + int ret = 0; + struct memory_tier *memtier; + + memtier = __get_memory_tier_from_id(tier); + if (!memtier) { + ret = -EINVAL; + goto out; + } + node_set(node, memtier->nodelist); +out: + return ret; +} + +int node_set_memory_tier(int node, int tier) +{ + struct memory_tier *memtier; + int ret = 0; + + mutex_lock(&memory_tier_lock); + memtier = __node_get_memory_tier(node); + /* + * if node is already part of the tier proceed with the + * current tier value, because we might want to establish + * new migration paths now. The node might be added to a tier + * before it was made part of N_MEMORY, hence estabilish_migration_targets + * will have skipped this node. + */ + if (!memtier) + ret = __node_set_memory_tier(node, tier); + establish_migration_targets(); + + mutex_unlock(&memory_tier_lock); + + return ret; +} + +/** + * next_demotion_node() - Get the next node in the demotion path + * @node: The starting node to lookup the next node + * + * Return: node id for next memory node in the demotion path hierarchy + * from @node; NUMA_NO_NODE if @node is terminal. This does not keep + * @node online or guarantee that it *continues* to be the next demotion + * target. + */ +int next_demotion_node(int node) +{ + struct demotion_nodes *nd; + int target; + + if (!node_demotion) + return NUMA_NO_NODE; + + nd = &node_demotion[node]; + + /* + * node_demotion[] is updated without excluding this + * function from running. + * + * Make sure to use RCU over entire code blocks if + * node_demotion[] reads need to be consistent. + */ + rcu_read_lock(); + /* + * If there are multiple target nodes, just select one + * target node randomly. + * + * In addition, we can also use round-robin to select + * target node, but we should introduce another variable + * for node_demotion[] to record last selected target node, + * that may cause cache ping-pong due to the changing of + * last target node. Or introducing per-cpu data to avoid + * caching issue, which seems more complicated. So selecting + * target node randomly seems better until now. + */ + target = node_random(&nd->preferred); + rcu_read_unlock(); + + return target; +} + +/* Disable reclaim-based migration. */ +static void __disable_all_migrate_targets(void) +{ + int node; + + for_each_node_state(node, N_MEMORY) + node_demotion[node].preferred = NODE_MASK_NONE; +} + +static void disable_all_migrate_targets(void) +{ + __disable_all_migrate_targets(); + + /* + * Ensure that the "disable" is visible across the system. + * Readers will see either a combination of before+disable + * state or disable+after. They will never see before and + * after state together. + */ + synchronize_rcu(); +} + +/* + * Find an automatic demotion target for all memory + * nodes. Failing here is OK. It might just indicate + * being at the end of a chain. + */ +static void establish_migration_targets(void) +{ + struct memory_tier *memtier; + struct demotion_nodes *nd; + int target = NUMA_NO_NODE, node; + int distance, best_distance; + nodemask_t used; + + if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) + return; + + disable_all_migrate_targets(); + + for_each_node_state(node, N_MEMORY) { + best_distance = -1; + nd = &node_demotion[node]; + + memtier = __node_get_memory_tier(node); + if (!memtier || list_is_last(&memtier->list, &memory_tiers)) + continue; + /* + * Get the next memtier to find the demotion node list. + */ + memtier = list_next_entry(memtier, list); + + /* + * find_next_best_node, use 'used' nodemask as a skip list. + * Add all memory nodes except the selected memory tier + * nodelist to skip list so that we find the best node from the + * memtier nodelist. + */ + nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist); + + /* + * Find all the nodes in the memory tier node list of same best distance. + * add them to the preferred mask. We randomly select between nodes + * in the preferred mask when allocating pages during demotion. + */ + do { + target = find_next_best_node(node, &used); + if (target == NUMA_NO_NODE) + break; + + distance = node_distance(node, target); + if (distance == best_distance || best_distance == -1) { + best_distance = distance; + node_set(target, nd->preferred); + } else { + break; + } + } while (1); + } +} + +/* + * This runs whether reclaim-based migration is enabled or not, + * which ensures that the user can turn reclaim-based migration + * at any time without needing to recalculate migration targets. + */ +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, + unsigned long action, void *_arg) +{ + struct memory_notify *arg = _arg; + + /* + * Only update the node migration order when a node is + * changing status, like online->offline. + */ + if (arg->status_change_nid < 0) + return notifier_from_errno(0); + + switch (action) { + case MEM_OFFLINE: + /* + * In case we are moving out of N_MEMORY. Keep the node + * in the memory tier so that when we bring memory online, + * they appear in the right memory tier. We still need + * to rebuild the demotion order. + */ + mutex_lock(&memory_tier_lock); + establish_migration_targets(); + mutex_unlock(&memory_tier_lock); + break; + case MEM_ONLINE: + /* + * We ignore the error here, if the node already have the tier + * registered, we will continue to use that for the new memory + * we are adding here. + */ + node_set_memory_tier(arg->status_change_nid, DEFAULT_MEMORY_TIER); + break; + } + + return notifier_from_errno(0); +} + +static void __init migrate_on_reclaim_init(void) +{ + + if (IS_ENABLED(CONFIG_MIGRATION)) { + node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes), + GFP_KERNEL); + WARN_ON(!node_demotion); + } + hotplug_memory_notifier(migrate_on_reclaim_callback, 100); +} + static int __init memory_tier_init(void) { struct memory_tier *memtier; @@ -162,6 +452,7 @@ static int __init memory_tier_init(void) /* CPU only nodes are not part of memory tiers. */ memtier->nodelist = node_states[N_MEMORY]; + migrate_on_reclaim_init(); return 0; } diff --git a/mm/migrate.c b/mm/migrate.c index 29cacc217e38..0b554625a219 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2116,398 +2116,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, return 0; } #endif /* CONFIG_NUMA_BALANCING */ - -/* - * node_demotion[] example: - * - * Consider a system with two sockets. Each socket has - * three classes of memory attached: fast, medium and slow. - * Each memory class is placed in its own NUMA node. The - * CPUs are placed in the node with the "fast" memory. The - * 6 NUMA nodes (0-5) might be split among the sockets like - * this: - * - * Socket A: 0, 1, 2 - * Socket B: 3, 4, 5 - * - * When Node 0 fills up, its memory should be migrated to - * Node 1. When Node 1 fills up, it should be migrated to - * Node 2. The migration path start on the nodes with the - * processors (since allocations default to this node) and - * fast memory, progress through medium and end with the - * slow memory: - * - * 0 -> 1 -> 2 -> stop - * 3 -> 4 -> 5 -> stop - * - * This is represented in the node_demotion[] like this: - * - * { nr=1, nodes[0]=1 }, // Node 0 migrates to 1 - * { nr=1, nodes[0]=2 }, // Node 1 migrates to 2 - * { nr=0, nodes[0]=-1 }, // Node 2 does not migrate - * { nr=1, nodes[0]=4 }, // Node 3 migrates to 4 - * { nr=1, nodes[0]=5 }, // Node 4 migrates to 5 - * { nr=0, nodes[0]=-1 }, // Node 5 does not migrate - * - * Moreover some systems may have multiple slow memory nodes. - * Suppose a system has one socket with 3 memory nodes, node 0 - * is fast memory type, and node 1/2 both are slow memory - * type, and the distance between fast memory node and slow - * memory node is same. So the migration path should be: - * - * 0 -> 1/2 -> stop - * - * This is represented in the node_demotion[] like this: - * { nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2 - * { nr=0, nodes[0]=-1, }, // Node 1 dose not migrate - * { nr=0, nodes[0]=-1, }, // Node 2 does not migrate - */ - -/* - * Writes to this array occur without locking. Cycles are - * not allowed: Node X demotes to Y which demotes to X... - * - * If multiple reads are performed, a single rcu_read_lock() - * must be held over all reads to ensure that no cycles are - * observed. - */ -#define DEFAULT_DEMOTION_TARGET_NODES 15 - -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES -#define DEMOTION_TARGET_NODES (MAX_NUMNODES - 1) -#else -#define DEMOTION_TARGET_NODES DEFAULT_DEMOTION_TARGET_NODES -#endif - -struct demotion_nodes { - unsigned short nr; - short nodes[DEMOTION_TARGET_NODES]; -}; - -static struct demotion_nodes *node_demotion __read_mostly; - -/** - * next_demotion_node() - Get the next node in the demotion path - * @node: The starting node to lookup the next node - * - * Return: node id for next memory node in the demotion path hierarchy - * from @node; NUMA_NO_NODE if @node is terminal. This does not keep - * @node online or guarantee that it *continues* to be the next demotion - * target. - */ -int next_demotion_node(int node) -{ - struct demotion_nodes *nd; - unsigned short target_nr, index; - int target; - - if (!node_demotion) - return NUMA_NO_NODE; - - nd = &node_demotion[node]; - - /* - * node_demotion[] is updated without excluding this - * function from running. RCU doesn't provide any - * compiler barriers, so the READ_ONCE() is required - * to avoid compiler reordering or read merging. - * - * Make sure to use RCU over entire code blocks if - * node_demotion[] reads need to be consistent. - */ - rcu_read_lock(); - target_nr = READ_ONCE(nd->nr); - - switch (target_nr) { - case 0: - target = NUMA_NO_NODE; - goto out; - case 1: - index = 0; - break; - default: - /* - * If there are multiple target nodes, just select one - * target node randomly. - * - * In addition, we can also use round-robin to select - * target node, but we should introduce another variable - * for node_demotion[] to record last selected target node, - * that may cause cache ping-pong due to the changing of - * last target node. Or introducing per-cpu data to avoid - * caching issue, which seems more complicated. So selecting - * target node randomly seems better until now. - */ - index = get_random_int() % target_nr; - break; - } - - target = READ_ONCE(nd->nodes[index]); - -out: - rcu_read_unlock(); - return target; -} - -/* Disable reclaim-based migration. */ -static void __disable_all_migrate_targets(void) -{ - int node, i; - - if (!node_demotion) - return; - - for_each_online_node(node) { - node_demotion[node].nr = 0; - for (i = 0; i < DEMOTION_TARGET_NODES; i++) - node_demotion[node].nodes[i] = NUMA_NO_NODE; - } -} - -static void disable_all_migrate_targets(void) -{ - __disable_all_migrate_targets(); - - /* - * Ensure that the "disable" is visible across the system. - * Readers will see either a combination of before+disable - * state or disable+after. They will never see before and - * after state together. - * - * The before+after state together might have cycles and - * could cause readers to do things like loop until this - * function finishes. This ensures they can only see a - * single "bad" read and would, for instance, only loop - * once. - */ - synchronize_rcu(); -} - -/* - * Find an automatic demotion target for 'node'. - * Failing here is OK. It might just indicate - * being at the end of a chain. - */ -static int establish_migrate_target(int node, nodemask_t *used, - int best_distance) -{ - int migration_target, index, val; - struct demotion_nodes *nd; - - if (!node_demotion) - return NUMA_NO_NODE; - - nd = &node_demotion[node]; - - migration_target = find_next_best_node(node, used); - if (migration_target == NUMA_NO_NODE) - return NUMA_NO_NODE; - - /* - * If the node has been set a migration target node before, - * which means it's the best distance between them. Still - * check if this node can be demoted to other target nodes - * if they have a same best distance. - */ - if (best_distance != -1) { - val = node_distance(node, migration_target); - if (val > best_distance) - goto out_clear; - } - - index = nd->nr; - if (WARN_ONCE(index >= DEMOTION_TARGET_NODES, - "Exceeds maximum demotion target nodes\n")) - goto out_clear; - - nd->nodes[index] = migration_target; - nd->nr++; - - return migration_target; -out_clear: - node_clear(migration_target, *used); - return NUMA_NO_NODE; -} - -/* - * When memory fills up on a node, memory contents can be - * automatically migrated to another node instead of - * discarded at reclaim. - * - * Establish a "migration path" which will start at nodes - * with CPUs and will follow the priorities used to build the - * page allocator zonelists. - * - * The difference here is that cycles must be avoided. If - * node0 migrates to node1, then neither node1, nor anything - * node1 migrates to can migrate to node0. Also one node can - * be migrated to multiple nodes if the target nodes all have - * a same best-distance against the source node. - * - * This function can run simultaneously with readers of - * node_demotion[]. However, it can not run simultaneously - * with itself. Exclusion is provided by memory hotplug events - * being single-threaded. - */ -static void __set_migration_target_nodes(void) -{ - nodemask_t next_pass; - nodemask_t this_pass; - nodemask_t used_targets = NODE_MASK_NONE; - int node, best_distance; - - /* - * Avoid any oddities like cycles that could occur - * from changes in the topology. This will leave - * a momentary gap when migration is disabled. - */ - disable_all_migrate_targets(); - - /* - * Allocations go close to CPUs, first. Assume that - * the migration path starts at the nodes with CPUs. - */ - next_pass = node_states[N_CPU]; -again: - this_pass = next_pass; - next_pass = NODE_MASK_NONE; - /* - * To avoid cycles in the migration "graph", ensure - * that migration sources are not future targets by - * setting them in 'used_targets'. Do this only - * once per pass so that multiple source nodes can - * share a target node. - * - * 'used_targets' will become unavailable in future - * passes. This limits some opportunities for - * multiple source nodes to share a destination. - */ - nodes_or(used_targets, used_targets, this_pass); - - for_each_node_mask(node, this_pass) { - best_distance = -1; - - /* - * Try to set up the migration path for the node, and the target - * migration nodes can be multiple, so doing a loop to find all - * the target nodes if they all have a best node distance. - */ - do { - int target_node = - establish_migrate_target(node, &used_targets, - best_distance); - - if (target_node == NUMA_NO_NODE) - break; - - if (best_distance == -1) - best_distance = node_distance(node, target_node); - - /* - * Visit targets from this pass in the next pass. - * Eventually, every node will have been part of - * a pass, and will become set in 'used_targets'. - */ - node_set(target_node, next_pass); - } while (1); - } - /* - * 'next_pass' contains nodes which became migration - * targets in this pass. Make additional passes until - * no more migrations targets are available. - */ - if (!nodes_empty(next_pass)) - goto again; -} - -/* - * For callers that do not hold get_online_mems() already. - */ -void set_migration_target_nodes(void) -{ - get_online_mems(); - __set_migration_target_nodes(); - put_online_mems(); -} - -/* - * This leaves migrate-on-reclaim transiently disabled between - * the MEM_GOING_OFFLINE and MEM_OFFLINE events. This runs - * whether reclaim-based migration is enabled or not, which - * ensures that the user can turn reclaim-based migration at - * any time without needing to recalculate migration targets. - * - * These callbacks already hold get_online_mems(). That is why - * __set_migration_target_nodes() can be used as opposed to - * set_migration_target_nodes(). - */ -#ifdef CONFIG_MEMORY_HOTPLUG -static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, - unsigned long action, void *_arg) -{ - struct memory_notify *arg = _arg; - - /* - * Only update the node migration order when a node is - * changing status, like online->offline. This avoids - * the overhead of synchronize_rcu() in most cases. - */ - if (arg->status_change_nid < 0) - return notifier_from_errno(0); - - switch (action) { - case MEM_GOING_OFFLINE: - /* - * Make sure there are not transient states where - * an offline node is a migration target. This - * will leave migration disabled until the offline - * completes and the MEM_OFFLINE case below runs. - */ - disable_all_migrate_targets(); - break; - case MEM_OFFLINE: - case MEM_ONLINE: - /* - * Recalculate the target nodes once the node - * reaches its final state (online or offline). - */ - __set_migration_target_nodes(); - break; - case MEM_CANCEL_OFFLINE: - /* - * MEM_GOING_OFFLINE disabled all the migration - * targets. Reenable them. - */ - __set_migration_target_nodes(); - break; - case MEM_GOING_ONLINE: - case MEM_CANCEL_ONLINE: - break; - } - - return notifier_from_errno(0); -} -#endif - -void __init migrate_on_reclaim_init(void) -{ - node_demotion = kcalloc(nr_node_ids, - sizeof(struct demotion_nodes), - GFP_KERNEL); - WARN_ON(!node_demotion); -#ifdef CONFIG_MEMORY_HOTPLUG - hotplug_memory_notifier(migrate_on_reclaim_callback, 100); -#endif - /* - * At this point, all numa nodes with memory/CPus have their state - * properly set, so we can build the demotion order now. - * Let us hold the cpu_hotplug lock just, as we could possibily have - * CPU hotplug events during boot. - */ - cpus_read_lock(); - set_migration_target_nodes(); - cpus_read_unlock(); -} #endif /* CONFIG_NUMA */ - - diff --git a/mm/vmstat.c b/mm/vmstat.c index 373d2730fcf2..35c6ff97cf29 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -28,7 +28,6 @@ #include #include #include -#include #include "internal.h" @@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu) if (!node_state(cpu_to_node(cpu), N_CPU)) { node_set_state(cpu_to_node(cpu), N_CPU); - set_migration_target_nodes(); } return 0; @@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu) return 0; node_clear_state(node, N_CPU); - set_migration_target_nodes(); return 0; } @@ -2118,7 +2115,6 @@ void __init init_mm_internals(void) start_shepherd_timer(); #endif - migrate_on_reclaim_init(); #ifdef CONFIG_PROC_FS proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op); proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op); From patchwork Fri Jun 10 13:52:22 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12877626 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52009C43334 for ; Fri, 10 Jun 2022 13:54:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E9B0D8D00A9; Fri, 10 Jun 2022 09:54:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E494E8D009C; Fri, 10 Jun 2022 09:54:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CC4528D00A9; Fri, 10 Jun 2022 09:54:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id BB70D8D009C for ; Fri, 10 Jun 2022 09:54:07 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 918B3602B0 for ; Fri, 10 Jun 2022 13:54:07 +0000 (UTC) X-FDA: 79562470134.07.6E4BFD1 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf14.hostedemail.com (Postfix) with ESMTP id 1353010005F for ; Fri, 10 Jun 2022 13:54:06 +0000 (UTC) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25ADSwk9014931; Fri, 10 Jun 2022 13:53:47 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=dDKqIfopwnXr6PuIPbN3tqtguXFEUdVD0LR6XZqxlbc=; b=HRuYDjZWC2cApnGXqBLPUfqa3acVvk4qqme9jQ8OSf1H5uG9RzC4TBnq+uztPvPn9sUR wF8J3/RL29LKfj0lLlR6f6d3H7WuofL4lRgA37XylIzAugGhDDEqwygZspFQV5JSsDIh /xobRaH+brGwSKVsaY89vbnmb+/wkGbjAeyr/1lsfg3ogakAn04DFGBfqy0ArRHJ6PmL qKJouoENkp0Jpg+qCPA1ucdNNrUPne8YQUo3YxGzRjActvj6HAlEFLLzSma9ONxhAC9k SksTSzRDcyZ7yl+ONeTfKlDNtCAcuO11WyxYZvxwP4LVRBbnhwALM89SSwjFgzitR1AM 5Q== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6vn0hd0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:47 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25ADUDcW019136; Fri, 10 Jun 2022 13:53:46 GMT Received: from ppma04dal.us.ibm.com (7a.29.35a9.ip4.static.sl-reverse.com [169.53.41.122]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6vn0hcp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:46 +0000 Received: from pps.filterd (ppma04dal.us.ibm.com [127.0.0.1]) by ppma04dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25ADZRaQ010073; Fri, 10 Jun 2022 13:53:45 GMT Received: from b03cxnp07029.gho.boulder.ibm.com (b03cxnp07029.gho.boulder.ibm.com [9.17.130.16]) by ppma04dal.us.ibm.com with ESMTP id 3gfy1au77d-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:45 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp07029.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25ADriwi32833824 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Jun 2022 13:53:44 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 372C86A051; Fri, 10 Jun 2022 13:53:44 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C88546A047; Fri, 10 Jun 2022 13:53:35 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.90.151]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 10 Jun 2022 13:53:35 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v6 06/13] mm/demotion: Expose memory tier details via sysfs Date: Fri, 10 Jun 2022 19:22:22 +0530 Message-Id: <20220610135229.182859-7-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> References: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: -mmRbWxBQy5VxIBwmRO7wr0XFrpI03gF X-Proofpoint-ORIG-GUID: MO1LcfSE9H5le05nc_sA0wsH95q8-xF3 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-10_06,2022-06-09_02,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 mlxlogscore=999 mlxscore=0 adultscore=0 spamscore=0 priorityscore=1501 phishscore=0 bulkscore=0 suspectscore=0 impostorscore=0 lowpriorityscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206100056 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654869247; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dDKqIfopwnXr6PuIPbN3tqtguXFEUdVD0LR6XZqxlbc=; b=y/bMcwslwf9TNn1etQwnfpkShn6QDNekiR6bOi6hsC8Bpr+9QtGuGtfOwdQrrXBNXoF2Tw 7WLKVCXRPUWsfRqibS8TqvaAk1ZNuDrSOr2CyFt0ccVygr1ylbpN+OhplRtOoAqoMHc15+ CULGae4eWSgtr7+XPajEInjgl+Y3XI0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654869247; a=rsa-sha256; cv=none; b=Ogja51PUr5JO2vDSZIyCJOx3/U0858zMHOWuPjjauIdTtH2EEDIit/7rAA09e8ibH+1itu pEm0lV6Y6qOnVbQEimuRYxtWm/a1N/RgI/HT9PJVePJ551d+yel0US9SjrMSBtymfWNPeJ 3IF+7R/KeS9jauqAtsAO6u8T3xl9tJc= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=HRuYDjZW; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf14.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspam-User: Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=HRuYDjZW; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf14.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam03 X-Stat-Signature: ixqcte9txdc8ach3pruqb5fqwhf9mij5 X-Rspamd-Queue-Id: 1353010005F X-HE-Tag: 1654869246-600092 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch adds /sys/devices/system/memtier/ where all memory tier related details can be found. All created memory tiers will be listed there as /sys/devices/system/memtier/memtierN/ The nodes which are part of a specific memory tier can be listed via /sys/devices/system/memtier/memtierN/nodelist The rank value of a memory tier can be listed via via /sys/devices/system/memtier/memtierN/rank /sys/devices/system/memtier/max_tier shows the maximum number of memory tiers that can be created. /sys/devices/system/memtier/default_tier shows the memory tier to which NUMA nodes get added by default if not assigned a specific memory tier. Signed-off-by: Aneesh Kumar K.V --- mm/memory-tiers.c | 99 +++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 95 insertions(+), 4 deletions(-) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 2f116912de43..51210f5efc1f 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -11,8 +11,8 @@ struct memory_tier { struct list_head list; + struct device dev; nodemask_t nodelist; - int id; int rank; }; @@ -20,6 +20,7 @@ struct demotion_nodes { nodemask_t preferred; }; +#define to_memory_tier(device) container_of(device, struct memory_tier, dev) static void establish_migration_targets(void); static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); @@ -86,6 +87,52 @@ static LIST_HEAD(memory_tiers); */ static struct demotion_nodes *node_demotion __read_mostly; +static struct bus_type memory_tier_subsys = { + .name = "memtier", + .dev_name = "memtier", +}; + +static ssize_t nodelist_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct memory_tier *memtier = to_memory_tier(dev); + + return sysfs_emit(buf, "%*pbl\n", + nodemask_pr_args(&memtier->nodelist)); +} +static DEVICE_ATTR_RO(nodelist); + +static ssize_t rank_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct memory_tier *memtier = to_memory_tier(dev); + + return sysfs_emit(buf, "%d\n", memtier->rank); +} +static DEVICE_ATTR_RO(rank); + +static struct attribute *memory_tier_dev_attrs[] = { + &dev_attr_nodelist.attr, + &dev_attr_rank.attr, + NULL +}; + +static const struct attribute_group memory_tier_dev_group = { + .attrs = memory_tier_dev_attrs, +}; + +static const struct attribute_group *memory_tier_dev_groups[] = { + &memory_tier_dev_group, + NULL +}; + +static void memory_tier_device_release(struct device *dev) +{ + struct memory_tier *tier = to_memory_tier(dev); + + kfree(tier); +} + /* * Keep it simple by having direct mapping between * tier index and rank value. @@ -121,6 +168,7 @@ static void insert_memory_tier(struct memory_tier *memtier) static struct memory_tier *register_memory_tier(unsigned int tier, unsigned int rank) { + int error; struct memory_tier *memtier; if (tier >= MAX_MEMORY_TIERS) @@ -130,11 +178,20 @@ static struct memory_tier *register_memory_tier(unsigned int tier, if (!memtier) return ERR_PTR(-ENOMEM); - memtier->id = tier; + memtier->dev.id = tier; memtier->rank = rank; + memtier->dev.bus = &memory_tier_subsys; + memtier->dev.release = memory_tier_device_release; + memtier->dev.groups = memory_tier_dev_groups; insert_memory_tier(memtier); + error = device_register(&memtier->dev); + if (error) { + list_del(&memtier->list); + put_device(&memtier->dev); + return ERR_PTR(error); + } return memtier; } @@ -154,7 +211,7 @@ static struct memory_tier *__get_memory_tier_from_id(int id) struct memory_tier *memtier; list_for_each_entry(memtier, &memory_tiers, list) { - if (memtier->id == id) + if (memtier->dev.id == id) return memtier; } return NULL; @@ -199,7 +256,7 @@ int node_create_and_set_memory_tier(int node, int tier) goto out; } - if (current_tier->id == tier) + if (current_tier->dev.id == tier) goto out; node_clear(node, current_tier->nodelist); @@ -435,10 +492,44 @@ static void __init migrate_on_reclaim_init(void) hotplug_memory_notifier(migrate_on_reclaim_callback, 100); } +static ssize_t +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS); +} +static DEVICE_ATTR_RO(max_tier); + +static ssize_t +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER); +} +static DEVICE_ATTR_RO(default_tier); + +static struct attribute *memory_tier_attrs[] = { + &dev_attr_max_tier.attr, + &dev_attr_default_tier.attr, + NULL +}; + +static const struct attribute_group memory_tier_attr_group = { + .attrs = memory_tier_attrs, +}; + +static const struct attribute_group *memory_tier_attr_groups[] = { + &memory_tier_attr_group, + NULL, +}; + static int __init memory_tier_init(void) { + int ret; struct memory_tier *memtier; + ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); + if (ret) + pr_err("%s() failed to register subsystem: %d\n", __func__, ret); + /* * Register only default memory tier to hide all empty * memory tier from sysfs. From patchwork Fri Jun 10 13:52:23 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12877627 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75A3FC433EF for ; Fri, 10 Jun 2022 13:54:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 107898D00AA; Fri, 10 Jun 2022 09:54:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0B5E08D009C; Fri, 10 Jun 2022 09:54:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E71EC8D00AA; Fri, 10 Jun 2022 09:54:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id D7F7A8D009C for ; Fri, 10 Jun 2022 09:54:18 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay11.hostedemail.com (Postfix) with ESMTP id AAA92811E4 for ; Fri, 10 Jun 2022 13:54:18 +0000 (UTC) X-FDA: 79562470596.23.8572D0A Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf16.hostedemail.com (Postfix) with ESMTP id 114FC180057 for ; Fri, 10 Jun 2022 13:54:17 +0000 (UTC) Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25AD7kOW030120; Fri, 10 Jun 2022 13:53:56 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=mCkQY9eq8ljfIN+4+Uneh6vfxHdvwScC98u2nmNKZr0=; b=UwlQHqVnyqOTTD5Oj94N7FHv21odR8jCf02WhNc+cXWr8fvxQn2WbYtXYCqHuyc/2l1J /fSiVSDq0D+pkH9UDVZPEaoAQ0R/JK4j7coa6uqcN39JQLODQErTIG4abhEA7f5tPz7l /2z70vMPf1htUbs8PNDR57NfgkZZBLHPGZuj+FxX9Aw9gZHanyFd0ZCeYmjzfOSWnYFs fPlgGy6lpbarokUNEfhQ7RNVYGNGqLo3azc6fi8LN/k3uyRfI8Dn/RgHcC+EpcaQ9HyO IFpXwUxkSaOFwYmqxn18GYEtFP/BSukFVfg68FB5wrrL6uffOqdnJr6KGVR/dz1IIgVk gg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm4vaaxxv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:56 +0000 Received: from m0098396.ppops.net (m0098396.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25ADPKNW032120; Fri, 10 Jun 2022 13:53:55 GMT Received: from ppma02dal.us.ibm.com (a.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.10]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm4vaaxxj-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:55 +0000 Received: from pps.filterd (ppma02dal.us.ibm.com [127.0.0.1]) by ppma02dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25ADZS9o014172; Fri, 10 Jun 2022 13:53:54 GMT Received: from b03cxnp08026.gho.boulder.ibm.com (b03cxnp08026.gho.boulder.ibm.com [9.17.130.18]) by ppma02dal.us.ibm.com with ESMTP id 3gfy1bb4kq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:53:54 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp08026.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25ADrrun32702788 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Jun 2022 13:53:53 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1643B6A057; Fri, 10 Jun 2022 13:53:53 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 09A7C6A04D; Fri, 10 Jun 2022 13:53:45 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.90.151]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 10 Jun 2022 13:53:44 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v6 07/13] mm/demotion: Add per node memory tier attribute to sysfs Date: Fri, 10 Jun 2022 19:22:23 +0530 Message-Id: <20220610135229.182859-8-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> References: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: lpIccryA9ZlRNWpkgFKmSLojMXjMgkGq X-Proofpoint-GUID: 4ZtrkFZg1m4-wH4m24WbekH28Sqg_6ZZ X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-10_06,2022-06-09_02,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 phishscore=0 priorityscore=1501 bulkscore=0 spamscore=0 malwarescore=0 adultscore=0 impostorscore=0 suspectscore=0 mlxlogscore=999 clxscore=1015 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206100056 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654869258; a=rsa-sha256; cv=none; b=G6Fzh3a6ngSwffD9JNUxC/Qt3V23F/crOUJf8YCxrtp86ByAT2zDtFfX8hCY96kzFz3pfH g86RTakhu1HrGJ9pvb9M1sH91ZynuZGqMYUNLPYAkYQTEugCSRiK9YgR6xq7EtHpiuOC1S 9wdqgYVZOsaudxGf9Hii09QIdxoPB8U= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=UwlQHqVn; spf=pass (imf16.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654869258; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mCkQY9eq8ljfIN+4+Uneh6vfxHdvwScC98u2nmNKZr0=; b=Do1yDjSqmi7LJHNyCMrfqMaNLQsSFe2EDGEvtrfSDOFx1YWlDAi2kncC2NaAFwEYkX7bZP CbDe4WiBYYoKDJt44kEo8ZALOPEg2CJadvrGlyDaFYurwUq6GldX1MIfdx/UUJJPXbWfCP F3OgROjnoklJN2T1jn62LC09hAzihd8= X-Rspamd-Queue-Id: 114FC180057 Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=UwlQHqVn; spf=pass (imf16.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspam-User: X-Rspamd-Server: rspam06 X-Stat-Signature: 7yh3xrs47pjwim8ubgchifwa9oy5rgyn X-HE-Tag: 1654869257-23497 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add support to modify the memory tier for a NUMA node. /sys/devices/system/node/nodeN/memtier where N = node id When read, It list the memory tier that the node belongs to. When written, the kernel moves the node into the specified memory tier, the tier assignment of all other nodes are not affected. If the memory tier does not exist an error is returned. Suggested-by: Wei Xu Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- drivers/base/node.c | 39 ++++++++++++++++++++++++++++++++ include/linux/memory-tiers.h | 3 +++ mm/memory-tiers.c | 44 ++++++++++++++++++++++++++++++++++++ 3 files changed, 86 insertions(+) diff --git a/drivers/base/node.c b/drivers/base/node.c index 0ac6376ef7a1..599ed64d910f 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -20,6 +20,7 @@ #include #include #include +#include static struct bus_type node_subsys = { .name = "node", @@ -560,11 +561,49 @@ static ssize_t node_read_distance(struct device *dev, } static DEVICE_ATTR(distance, 0444, node_read_distance, NULL); +#ifdef CONFIG_TIERED_MEMORY +static ssize_t memtier_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + int node = dev->id; + int tier_index = node_get_memory_tier_id(node); + + if (tier_index != -1) + return sysfs_emit(buf, "%d\n", tier_index); + return 0; +} + +static ssize_t memtier_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + unsigned long tier; + int node = dev->id; + int ret; + + ret = kstrtoul(buf, 10, &tier); + if (ret) + return ret; + + ret = node_reset_memory_tier(node, tier); + if (ret) + return ret; + + return count; +} + +static DEVICE_ATTR_RW(memtier); +#endif + static struct attribute *node_dev_attrs[] = { &dev_attr_meminfo.attr, &dev_attr_numastat.attr, &dev_attr_distance.attr, &dev_attr_vmstat.attr, +#ifdef CONFIG_TIERED_MEMORY + &dev_attr_memtier.attr, +#endif NULL }; diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 18dd1ab7b96e..e70f0040d845 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -20,6 +20,9 @@ extern bool numa_demotion_enabled; int node_create_and_set_memory_tier(int node, int tier); int next_demotion_node(int node); +int node_set_memory_tier(int node, int tier); +int node_get_memory_tier_id(int node); +int node_reset_memory_tier(int node, int tier); #else #define numa_demotion_enabled false static inline int next_demotion_node(int node) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 51210f5efc1f..7bfdfac4d43e 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -313,6 +313,50 @@ int node_set_memory_tier(int node, int tier) return ret; } +int node_get_memory_tier_id(int node) +{ + int tier = -1; + struct memory_tier *memtier; + /* + * Make sure memory tier is not unregistered + * while it is being read. + */ + mutex_lock(&memory_tier_lock); + memtier = __node_get_memory_tier(node); + if (memtier) + tier = memtier->dev.id; + mutex_unlock(&memory_tier_lock); + + return tier; +} + +int node_reset_memory_tier(int node, int tier) +{ + struct memory_tier *current_tier; + int ret = 0; + + mutex_lock(&memory_tier_lock); + + current_tier = __node_get_memory_tier(node); + if (!current_tier || current_tier->dev.id == tier) + goto out; + + node_clear(node, current_tier->nodelist); + + ret = __node_set_memory_tier(node, tier); + if (ret) { + /* reset it back to older tier */ + node_set(node, current_tier->nodelist); + goto out; + } + + establish_migration_targets(); +out: + mutex_unlock(&memory_tier_lock); + + return ret; +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node From patchwork Fri Jun 10 13:52:24 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12877630 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8F742C433EF for ; Fri, 10 Jun 2022 13:54:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2B5618D00AD; Fri, 10 Jun 2022 09:54:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 263148D009C; Fri, 10 Jun 2022 09:54:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 104528D00AD; Fri, 10 Jun 2022 09:54:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 00FB48D009C for ; Fri, 10 Jun 2022 09:54:54 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id DAE7D203F0 for ; Fri, 10 Jun 2022 13:54:54 +0000 (UTC) X-FDA: 79562472108.15.B672526 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf16.hostedemail.com (Postfix) with ESMTP id 7503B18004C for ; Fri, 10 Jun 2022 13:54:54 +0000 (UTC) Received: from pps.filterd (m0098414.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25ADDjen032541; Fri, 10 Jun 2022 13:54:05 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=E8x/yh96TqC956iIz99rLchksQSqqOP2r9I9Ns2lU6U=; b=taKkK7F75PuCp32ofRyxChUYraLZW1q4rwICekYwVPCe/LIf3qhw94TNygI+SMbSqLyn msOiPC/wyHACvZFYBTV3GnLE2YWOnihQqvXMquJM0ZZ5x6SmovjyeAqO5nPn45/uCPui VX9zOAPvOEEi2KYRGQiWose6qmT6HGEYlw6VsYbaU2+NDU3MhSkTQx+I4C14X3fsINc0 gCGuLstWKMPRjeaK44KoRYiZYvjHamuEHib5S/E19x46QNlEgJ4iPgFtVxhbzmF/E1oB 4Oy+zYt3MdFiFBDGVPV+8JCibHCdJHbSHBuyu1bPrcFI5byeCU+RCF8CtzVoBo+x6nXM CA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6na0s6m-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:04 +0000 Received: from m0098414.ppops.net (m0098414.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25ADEZZr002211; Fri, 10 Jun 2022 13:54:04 GMT Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6na0s6a-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:04 +0000 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25ADZRnF004832; Fri, 10 Jun 2022 13:54:03 GMT Received: from b03cxnp08028.gho.boulder.ibm.com (b03cxnp08028.gho.boulder.ibm.com [9.17.130.20]) by ppma04wdc.us.ibm.com with ESMTP id 3gfy1a91y2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:03 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp08028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25ADs2SQ39059824 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Jun 2022 13:54:02 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7C9BE6A047; Fri, 10 Jun 2022 13:54:02 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E6DB66A054; Fri, 10 Jun 2022 13:53:53 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.90.151]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 10 Jun 2022 13:53:53 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v6 08/13] mm/demotion: Add support for memory tier creation from userspace Date: Fri, 10 Jun 2022 19:22:24 +0530 Message-Id: <20220610135229.182859-9-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> References: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 9V-4J5ghCf3E5erXmEax8vOzO2KdxVXG X-Proofpoint-ORIG-GUID: lmUY9NtnGV2mAsJ6OWZHXEZ55nrrCO5M X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-10_06,2022-06-09_02,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 priorityscore=1501 impostorscore=0 mlxlogscore=999 clxscore=1015 adultscore=0 suspectscore=0 lowpriorityscore=0 spamscore=0 mlxscore=0 phishscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206100056 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654869294; a=rsa-sha256; cv=none; b=M2AHg9kB2zzrRwOclSyBSD0wX2IeHuI5nbMezsau2NQiFX6C6woZHNXFiY4iLND8vgNltS AYqijM6Vur51SbFMEU+7tTX0ZmYCRrQgQzgZakNydRauBf2W1rwqcRkwnjcrJQmDZRbIwK Zj93bcRt6Y3YHeHH1gUsBaD6P8KF1dY= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654869294; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=E8x/yh96TqC956iIz99rLchksQSqqOP2r9I9Ns2lU6U=; b=hSmBZ/lWa8rzrOBOZgqOj1WBS5Ed+TizpaF23Ok1OnrLbmD/NlZvCsp0dAeDy9wo8S9M9w Zn0Zv9iaHRaRGdUya/k7L5ucnZIWBPtmT+mWQTo0QcbnnqVEiKUioAF2ILIBRB5+ue7tIA ZUwmN5p5SAO4Bjwm+DdZ0JekrvcmcEc= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=taKkK7F7; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf16.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=taKkK7F7; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf16.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Stat-Signature: of66wgiises5typd9ucfjoh985my4b1p X-Rspamd-Queue-Id: 7503B18004C X-Rspamd-Server: rspam12 X-Rspam-User: X-HE-Tag: 1654869294-882479 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch adds support to create memory tiers with specific rank value from userspace. To avoid race while creating memory tiers /sys/devices/system/memtier/create_tier_from_rank file is provided. Writing to this file with a specific rank value creates a new memory tier with the specified rank value. Memory tiers created from userspace gets destroyed when the memory tier nodelist becomes empty. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 3 +- mm/memory-tiers.c | 74 ++++++++++++++++++++++++++++++++++++ 2 files changed, 76 insertions(+), 1 deletion(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index e70f0040d845..52896f5970b7 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -15,7 +15,8 @@ #define MEMORY_RANK_PMEM 100 #define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM -#define MAX_MEMORY_TIERS 3 +#define MAX_STATIC_MEMORY_TIERS 3 +#define MAX_MEMORY_TIERS (MAX_STATIC_MEMORY_TIERS + 2) extern bool numa_demotion_enabled; int node_create_and_set_memory_tier(int node, int tier); diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 7bfdfac4d43e..de3b7403ae6f 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -6,6 +6,7 @@ #include #include #include +#include #include "internal.h" @@ -126,9 +127,12 @@ static const struct attribute_group *memory_tier_dev_groups[] = { NULL }; +static DEFINE_IDA(memtier_dev_id); static void memory_tier_device_release(struct device *dev) { struct memory_tier *tier = to_memory_tier(dev); + if (tier->dev.id >= MAX_STATIC_MEMORY_TIERS) + ida_free(&memtier_dev_id, tier->dev.id); kfree(tier); } @@ -195,6 +199,17 @@ static struct memory_tier *register_memory_tier(unsigned int tier, return memtier; } +static void unregister_memory_tier(struct memory_tier *memtier) +{ + /* + * Don't destroy static memory tiers. + */ + if (memtier->dev.id < MAX_STATIC_MEMORY_TIERS) + return; + list_del(&memtier->list); + device_unregister(&memtier->dev); +} + static struct memory_tier *__node_get_memory_tier(int node) { struct memory_tier *memtier; @@ -267,6 +282,10 @@ int node_create_and_set_memory_tier(int node, int tier) node_set(node, current_tier->nodelist); goto out; } + + if (nodes_empty(current_tier->nodelist)) + unregister_memory_tier(current_tier); + establish_migration_targets(); out: mutex_unlock(&memory_tier_lock); @@ -350,6 +369,9 @@ int node_reset_memory_tier(int node, int tier) goto out; } + if (nodes_empty(current_tier->nodelist)) + unregister_memory_tier(current_tier); + establish_migration_targets(); out: mutex_unlock(&memory_tier_lock); @@ -550,9 +572,61 @@ default_tier_show(struct device *dev, struct device_attribute *attr, char *buf) } static DEVICE_ATTR_RO(default_tier); +static inline int memtier_alloc_id(void) +{ + return ida_alloc_range(&memtier_dev_id, + MAX_STATIC_MEMORY_TIERS, + MAX_MEMORY_TIERS - 1, GFP_KERNEL); +} + +static ssize_t create_tier_from_rank_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + int ret, tier, rank; + struct memory_tier *memtier; + + ret = kstrtouint(buf, 10, &rank); + if (ret) + return ret; + + if (rank == MEMORY_RANK_HBM_GPU || + rank == MEMORY_RANK_DRAM || + rank == MEMORY_RANK_PMEM) + return -EINVAL; + + mutex_lock(&memory_tier_lock); + /* + * We don't support multiple tiers with same rank value + */ + list_for_each_entry(memtier, &memory_tiers, list) { + if (memtier->rank == rank) { + ret = -EINVAL; + goto out; + } + } + tier = memtier_alloc_id(); + if (tier < 0) { + ret = -ENOSPC; + goto out; + } + memtier = register_memory_tier(tier, rank); + if (IS_ERR(memtier)) { + ret = PTR_ERR(memtier); + goto out; + } + + ret = count; +out: + mutex_unlock(&memory_tier_lock); + return ret; +} +static DEVICE_ATTR_WO(create_tier_from_rank); + static struct attribute *memory_tier_attrs[] = { &dev_attr_max_tier.attr, &dev_attr_default_tier.attr, + &dev_attr_create_tier_from_rank.attr, NULL }; From patchwork Fri Jun 10 13:52:25 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12877628 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E429FC43334 for ; Fri, 10 Jun 2022 13:54:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7BD1D8D00AB; Fri, 10 Jun 2022 09:54:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 76C118D009C; Fri, 10 Jun 2022 09:54:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5BF7C8D00AB; Fri, 10 Jun 2022 09:54:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 49E078D009C for ; Fri, 10 Jun 2022 09:54:22 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 2484B6028D for ; Fri, 10 Jun 2022 13:54:22 +0000 (UTC) X-FDA: 79562470764.25.2C334B8 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf08.hostedemail.com (Postfix) with ESMTP id BAB23160077 for ; Fri, 10 Jun 2022 13:54:21 +0000 (UTC) Received: from pps.filterd (m0098413.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25ADoGpJ016533; Fri, 10 Jun 2022 13:54:14 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=Kd0wMNzxttpCi5E/zfEkRFMTVjdfKDndNDT6ciUBrfI=; b=S5s/+eH864Oejunle40mFuQNcJCf/kdZp2UTc7RnuzdxKD2l76iqQvMLk9nsPX4ie8ab Fj5DE3hvRh+iw6IRqg2Sb5QVJrIvq+O9eRpC0WpGsbn3menuxy/PtQdfESJXIFbuIaet Ko1GfMBQChBmY/hEJJB7haNCgDNRmLT+oxqL+augQlRxvkarsaGLuPQdBA+wpmITqs2A hYwocUCx8njCHvcmq3MeAjn70WQDPjdamEWXkS5nLBM4WQ0guezBFnzckUj3VydMm7Pl t1KJT3ugnZARgl9QV9AZ0NWglWaz2uHKN970Q+jWjlKEN1iK4vlwB37wsoZKi1jW24qS bg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6q9gjd1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:14 +0000 Received: from m0098413.ppops.net (m0098413.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25ADoiOI021768; Fri, 10 Jun 2022 13:54:13 GMT Received: from ppma02wdc.us.ibm.com (aa.5b.37a9.ip4.static.sl-reverse.com [169.55.91.170]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6q9gjcs-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:13 +0000 Received: from pps.filterd (ppma02wdc.us.ibm.com [127.0.0.1]) by ppma02wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25ADaauJ016202; Fri, 10 Jun 2022 13:54:13 GMT Received: from b03cxnp07028.gho.boulder.ibm.com (b03cxnp07028.gho.boulder.ibm.com [9.17.130.15]) by ppma02wdc.us.ibm.com with ESMTP id 3gfy1a900u-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:12 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp07028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25ADsBHE28508624 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Jun 2022 13:54:12 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id DDBA86A054; Fri, 10 Jun 2022 13:54:11 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5A7646A047; Fri, 10 Jun 2022 13:54:03 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.90.151]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 10 Jun 2022 13:54:02 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v6 09/13] mm/demotion: Add pg_data_t member to track node memory tier details Date: Fri, 10 Jun 2022 19:22:25 +0530 Message-Id: <20220610135229.182859-10-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> References: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: pp6MvJVh_iMGCNUYlGb2QF7EzkPXNxmh X-Proofpoint-GUID: eXBoSX6qARFNOI30Yq99TzeI299l3-Vm X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-10_06,2022-06-09_02,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=0 impostorscore=0 phishscore=0 spamscore=0 mlxlogscore=999 adultscore=0 bulkscore=0 priorityscore=1501 clxscore=1015 lowpriorityscore=0 mlxscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206100056 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654869261; a=rsa-sha256; cv=none; b=qoWg+EyBvzlAfeuxDzhyqfHkpB6twECn2p7TuDm+GQFqai45eT4bXz2xJ78IX69SvGWCqH rEvd1awCJCyyuC4rvF/xTyzof3K1RzcaBoWsl34zwrnzHXNF1Kz2wKq1cuG0v9R2yFnO0Y hT7UTgFEZAwDY0XnTubl3J9Mhnzckmo= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="S5s/+eH8"; spf=pass (imf08.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654869261; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Kd0wMNzxttpCi5E/zfEkRFMTVjdfKDndNDT6ciUBrfI=; b=1z5vHCdtJsIR7+WWI+/YnqA9SuClG+5j/Z6NtMz4qDokNnOrpkJFqIwDooMr/F/JkYjvxK iC02LK8+QIMQYohzA//XXZe6huzhvV7H4Gq48Tu2sHxZ7OJzuZ+X3veTJ0sIfpYPyavWRC o+I300JjGoSlpoEwvWsDvRAujb5/MNQ= X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: BAB23160077 X-Rspam-User: Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="S5s/+eH8"; spf=pass (imf08.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Stat-Signature: gtpxsdypjnxukdobe9bktsezju3ecxgb X-HE-Tag: 1654869261-226095 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Also update different helpes to use NODE_DATA()->memtier. Since node specific memtier can change based on the reassignment of NUMA node to a different memory tiers, accessing NODE_DATA()->memtier needs to under an rcu read lock of memory_tier_lock. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 14 +++++ include/linux/mmzone.h | 3 ++ mm/memory-tiers.c | 102 ++++++++++++++++++++++++++--------- 3 files changed, 94 insertions(+), 25 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 52896f5970b7..53f3e4c7cba8 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -6,6 +6,9 @@ #ifdef CONFIG_TIERED_MEMORY +#include +#include + #define MEMORY_TIER_HBM_GPU 0 #define MEMORY_TIER_DRAM 1 #define MEMORY_TIER_PMEM 2 @@ -18,13 +21,24 @@ #define MAX_STATIC_MEMORY_TIERS 3 #define MAX_MEMORY_TIERS (MAX_STATIC_MEMORY_TIERS + 2) +struct memory_tier { + struct list_head list; + struct device dev; + nodemask_t nodelist; + int rank; +}; + extern bool numa_demotion_enabled; int node_create_and_set_memory_tier(int node, int tier); int next_demotion_node(int node); int node_set_memory_tier(int node, int tier); int node_get_memory_tier_id(int node); int node_reset_memory_tier(int node, int tier); +struct memory_tier *node_get_memory_tier(int node); +void node_put_memory_tier(struct memory_tier *memtier); + #else + #define numa_demotion_enabled false static inline int next_demotion_node(int node) { diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index aab70355d64f..c4fcfd2b9980 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -928,6 +928,9 @@ typedef struct pglist_data { /* Per-node vmstats */ struct per_cpu_nodestat __percpu *per_cpu_nodestats; atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; +#ifdef CONFIG_TIERED_MEMORY + struct memory_tier *memtier; +#endif } pg_data_t; #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index de3b7403ae6f..429aa864edb0 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -1,22 +1,14 @@ // SPDX-License-Identifier: GPL-2.0 #include -#include -#include #include #include #include #include #include +#include #include "internal.h" -struct memory_tier { - struct list_head list; - struct device dev; - nodemask_t nodelist; - int rank; -}; - struct demotion_nodes { nodemask_t preferred; }; @@ -212,13 +204,17 @@ static void unregister_memory_tier(struct memory_tier *memtier) static struct memory_tier *__node_get_memory_tier(int node) { - struct memory_tier *memtier; + pg_data_t *pgdat; - list_for_each_entry(memtier, &memory_tiers, list) { - if (node_isset(node, memtier->nodelist)) - return memtier; - } - return NULL; + pgdat = NODE_DATA(node); + if (!pgdat) + return NULL; + /* + * Since we hold memory_tier_lock, we can avoid + * RCU read locks when accessing the details. No + * parallel updates are possible here. + */ + return pgdat->memtier; } static struct memory_tier *__get_memory_tier_from_id(int id) @@ -232,6 +228,32 @@ static struct memory_tier *__get_memory_tier_from_id(int id) return NULL; } +/* + * Called with memory_tier_lock. Hence the device references cannot + * be dropped during this function. + */ +static void memtier_node_set(int node, struct memory_tier *memtier) +{ + pg_data_t *pgdat; + struct memory_tier *current_memtier; + + pgdat = NODE_DATA(node); + if (!pgdat) + return; + /* + * Make sure we mark the memtier NULL before we assign the new memory tier + * to the NUMA node. This make sure that anybody looking at NODE_DATA + * finds a NULL memtier or the one which is still valid. + */ + current_memtier = rcu_dereference(pgdat->memtier); + rcu_assign_pointer(pgdat->memtier, NULL); + if (current_memtier) + node_clear(node, current_memtier->nodelist); + synchronize_rcu(); + node_set(node, memtier->nodelist); + rcu_assign_pointer(pgdat->memtier, memtier); +} + static int __node_create_and_set_memory_tier(int node, int tier) { int ret = 0; @@ -252,7 +274,7 @@ static int __node_create_and_set_memory_tier(int node, int tier) goto out; } } - node_set(node, memtier->nodelist); + memtier_node_set(node, memtier); out: return ret; } @@ -274,12 +296,10 @@ int node_create_and_set_memory_tier(int node, int tier) if (current_tier->dev.id == tier) goto out; - node_clear(node, current_tier->nodelist); - ret = __node_create_and_set_memory_tier(node, tier); if (ret) { /* reset it back to older tier */ - node_set(node, current_tier->nodelist); + memtier_node_set(node, current_tier); goto out; } @@ -304,7 +324,7 @@ static int __node_set_memory_tier(int node, int tier) ret = -EINVAL; goto out; } - node_set(node, memtier->nodelist); + memtier_node_set(node, memtier); out: return ret; } @@ -360,12 +380,10 @@ int node_reset_memory_tier(int node, int tier) if (!current_tier || current_tier->dev.id == tier) goto out; - node_clear(node, current_tier->nodelist); - ret = __node_set_memory_tier(node, tier); if (ret) { /* reset it back to older tier */ - node_set(node, current_tier->nodelist); + memtier_node_set(node, current_tier); goto out; } @@ -379,6 +397,34 @@ int node_reset_memory_tier(int node, int tier) return ret; } +/* + * lockless access to memory tier of a NUMA node. + */ +struct memory_tier *node_get_memory_tier(int node) +{ + pg_data_t *pgdat; + struct memory_tier *memtier; + + pgdat = NODE_DATA(node); + if (!pgdat) + return NULL; + + rcu_read_lock(); + memtier = rcu_dereference(pgdat->memtier); + if (!memtier) + goto out; + + get_device(&memtier->dev); +out: + rcu_read_unlock(); + return memtier; +} + +void node_put_memory_tier(struct memory_tier *memtier) +{ + put_device(&memtier->dev); +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -641,7 +687,7 @@ static const struct attribute_group *memory_tier_attr_groups[] = { static int __init memory_tier_init(void) { - int ret; + int ret, node; struct memory_tier *memtier; ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); @@ -660,7 +706,13 @@ static int __init memory_tier_init(void) __func__, PTR_ERR(memtier)); /* CPU only nodes are not part of memory tiers. */ - memtier->nodelist = node_states[N_MEMORY]; + for_each_node_state(node, N_MEMORY) { + /* + * Should be safe to do this early in the boot. + */ + NODE_DATA(node)->memtier = memtier; + node_set(node, memtier->nodelist); + } migrate_on_reclaim_init(); return 0; From patchwork Fri Jun 10 13:52:26 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12877631 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57A83C43334 for ; Fri, 10 Jun 2022 13:55:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DB0D88D00AE; Fri, 10 Jun 2022 09:55:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D61738D00AF; Fri, 10 Jun 2022 09:55:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BD9F18D00AE; Fri, 10 Jun 2022 09:55:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A9EA08D009C for ; Fri, 10 Jun 2022 09:55:39 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 7D19B347B8 for ; Fri, 10 Jun 2022 13:55:39 +0000 (UTC) X-FDA: 79562473998.21.2B0F3BC Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf23.hostedemail.com (Postfix) with ESMTP id D38AA140066 for ; Fri, 10 Jun 2022 13:55:38 +0000 (UTC) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25ADSwkM014931; Fri, 10 Jun 2022 13:54:24 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=FpxL9Ln0v69rC5KIlQJwCAYoGwImaAzWCPNQvJ60qXQ=; b=fdIxt0zxlEGhWffkJ34cWNH5xr9EFe2+k3oNYZwD4vh2vVfR8c4Y3UjTXcw+pjzuUz/C JTtvzr9lRGFeoYZ2OVgbF887Th06jUI2ci3bBeGZBy64Z7d6l3cL1raqTuaHP/WZokbX R0c0vLEig3nWevFtEoBEOjMNl9V2WbQ+zqqKNmTc7qvLRibRdoGvRKOkc/8HGOnoiCME 3PkiOOA7J0m9nDz58MWRi/VS9F81BhAEQ2qfhXlz4xBFOsDAI6FX2LXVuwoBAq8bIMG9 LWOMkHLq4d2fGWcsB+dd6VrcWT5wahURSiFzqJWVFSGyZzGB4yVLWu5L5kDDW8ePwYyU Rw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6vn0hny-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:24 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25ADTVND016419; Fri, 10 Jun 2022 13:54:23 GMT Received: from ppma02dal.us.ibm.com (a.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.10]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6vn0hnm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:23 +0000 Received: from pps.filterd (ppma02dal.us.ibm.com [127.0.0.1]) by ppma02dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25ADZUgA014203; Fri, 10 Jun 2022 13:54:22 GMT Received: from b03cxnp08026.gho.boulder.ibm.com (b03cxnp08026.gho.boulder.ibm.com [9.17.130.18]) by ppma02dal.us.ibm.com with ESMTP id 3gfy1bb4nu-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:22 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp08026.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25ADsLhV27787614 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Jun 2022 13:54:21 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 15B9C6A054; Fri, 10 Jun 2022 13:54:21 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A75976A051; Fri, 10 Jun 2022 13:54:12 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.90.151]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 10 Jun 2022 13:54:12 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K . V" Subject: [PATCH v6 10/13] mm/demotion: Demote pages according to allocation fallback order Date: Fri, 10 Jun 2022 19:22:26 +0530 Message-Id: <20220610135229.182859-11-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> References: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: K2OOSuMIPyM457UvPbCNfWNYppAmgXji X-Proofpoint-ORIG-GUID: uusJ_FgopAssqf6frHcE5ZCvfQ4kDUV8 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-10_06,2022-06-09_02,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 mlxlogscore=999 mlxscore=0 adultscore=0 spamscore=0 priorityscore=1501 phishscore=0 bulkscore=0 suspectscore=0 impostorscore=0 lowpriorityscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206100056 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654869339; a=rsa-sha256; cv=none; b=YGrZwkes68TqS4WijP92puTYpbgMLIULNZn8NYkMQwjhF/ZphLkBl/7cG8UKKIIzkMJP7A UumubfCW1T8E54n5P5V+EEgg9/hzYNZVn7kVMOFLVFgAuZrtWP38cJG9KlBImHFz06QpXZ Bz9qy5AxwrWrliaFtwJpkJkhNMl3X/U= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=fdIxt0zx; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf23.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654869339; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FpxL9Ln0v69rC5KIlQJwCAYoGwImaAzWCPNQvJ60qXQ=; b=3qM13IS4RY8acyi2Oo+g6ekngKCGmDAefFPaYg/+hnzDsgrRiH7Zp5R/dCU2Pgw1HBOoOX xKbMo0azKRIZwA9TvojtOVgSj2U++OdF6qUPAsx97SWH2bW4lcWMv0C1tqfiEzJ/NH9rPH N0Ka0Yvn4cUsgHxkOz7oPftEqDHrnHg= X-Rspamd-Queue-Id: D38AA140066 X-Rspam-User: Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=fdIxt0zx; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf23.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Stat-Signature: jhz7m5yrmnmfcna534x13gy5ewkshr9n X-Rspamd-Server: rspam02 X-HE-Tag: 1654869338-543506 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya Currently, a higher tier node can only be demoted to selected nodes on the next lower tier as defined by the demotion path. This strict, hard-coded demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space). This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that currently. This patch adds support to get all the allowed demotion targets for a memory tier. demote_page_list() function is now modified to utilize this allowed node mask as the fallback allocation mask. Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V move allowed mask to memory tier --- include/linux/memory-tiers.h | 9 ++++- mm/memory-tiers.c | 75 +++++++++++++++++++++++++++++++++--- mm/vmscan.c | 56 ++++++++++++++++++++------- 3 files changed, 120 insertions(+), 20 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 53f3e4c7cba8..47841379553c 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -3,11 +3,12 @@ #define _LINUX_MEMORY_TIERS_H #include +#include +#include #ifdef CONFIG_TIERED_MEMORY #include -#include #define MEMORY_TIER_HBM_GPU 0 #define MEMORY_TIER_DRAM 1 @@ -25,6 +26,7 @@ struct memory_tier { struct list_head list; struct device dev; nodemask_t nodelist; + nodemask_t lower_tier_mask; int rank; }; @@ -36,6 +38,7 @@ int node_get_memory_tier_id(int node); int node_reset_memory_tier(int node, int tier); struct memory_tier *node_get_memory_tier(int node); void node_put_memory_tier(struct memory_tier *memtier); +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); #else @@ -45,6 +48,10 @@ static inline int next_demotion_node(int node) return NUMA_NO_NODE; } +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + *targets = NODE_MASK_NONE; +} #endif /* CONFIG_TIERED_MEMORY */ #endif diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 429aa864edb0..b2ed16dcfb03 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -425,6 +425,24 @@ void node_put_memory_tier(struct memory_tier *memtier) put_device(&memtier->dev); } +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + struct memory_tier *memtier; + + /* + * pg_data_t.memtier updates includes a synchronize_rcu() + * which ensures that we either find NULL or a valid memtier + * in NODE_DATA. protect the access via rcu_read_lock(); + */ + rcu_read_lock(); + memtier = rcu_dereference(pgdat->memtier); + if (memtier) + *targets = memtier->lower_tier_mask; + else + *targets = NODE_MASK_NONE; + rcu_read_unlock(); +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -473,10 +491,18 @@ int next_demotion_node(int node) /* Disable reclaim-based migration. */ static void __disable_all_migrate_targets(void) { + pg_data_t *pgdat; int node; - for_each_node_state(node, N_MEMORY) + for_each_node_state(node, N_MEMORY) { node_demotion[node].preferred = NODE_MASK_NONE; + /* + * We are holding memory_tier_lock, it is safe + * to access pgda->memtier. + */ + pgdat = NODE_DATA(node); + pgdat->memtier->lower_tier_mask = NODE_MASK_NONE; + } } static void disable_all_migrate_targets(void) @@ -503,10 +529,26 @@ static void establish_migration_targets(void) struct demotion_nodes *nd; int target = NUMA_NO_NODE, node; int distance, best_distance; - nodemask_t used; - - if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) - return; + nodemask_t used, lower_tier = NODE_MASK_NONE; + + if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) { + pg_data_t *pgdat; + + for_each_node_state(node, N_MEMORY) { + /* + * We are holding memory_tier_lock, it is safe + * to access pgda->memtier. + */ + pgdat = NODE_DATA(node); + pgdat->memtier->lower_tier_mask = NODE_MASK_NONE; + } + /* + * Wait for read side to work with old values + * or see the updated NODE_MASK_NONE; + */ + synchronize_rcu(); + goto build_lower_tier_mask; + } disable_all_migrate_targets(); @@ -549,6 +591,29 @@ static void establish_migration_targets(void) } } while (1); } +build_lower_tier_mask: + /* + * Now build the lower_tier mask for each node collecting node mask from + * all memory tier below it. This allows us to fallback demotion page + * allocation to a set of nodes that is closer the above selected + * perferred node. + */ + list_for_each_entry(memtier, &memory_tiers, list) + nodes_or(lower_tier, lower_tier, memtier->nodelist); + /* + * Removes nodes not yet in N_MEMORY. + */ + nodes_and(lower_tier, node_states[N_MEMORY], lower_tier); + + list_for_each_entry(memtier, &memory_tiers, list) { + /* + * Keep removing current tier from lower_tier nodes, + * This will remove all nodes in current and above + * memory tier from the lower_tier mask. + */ + nodes_andnot(lower_tier, lower_tier, memtier->nodelist); + memtier->lower_tier_mask = lower_tier; + } } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 3a8f78277f99..2b213248effa 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1460,19 +1460,32 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } -static struct page *alloc_demote_page(struct page *page, unsigned long node) +static struct page *alloc_demote_page(struct page *page, unsigned long private) { - struct migration_target_control mtc = { - /* - * Allocate from 'node', or fail quickly and quietly. - * When this happens, 'page' will likely just be discarded - * instead of migrated. - */ - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | - __GFP_THISNODE | __GFP_NOWARN | - __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid = node - }; + struct page *target_page; + nodemask_t *allowed_mask; + struct migration_target_control *mtc; + + mtc = (struct migration_target_control *)private; + + allowed_mask = mtc->nmask; + /* + * make sure we allocate from the target node first also trying to + * reclaim pages from the target node via kswapd if we are low on + * free memory on target node. If we don't do this and if we have low + * free memory on the target memtier, we would start allocating pages + * from higher memory tiers without even forcing a demotion of cold + * pages from the target memtier. This can result in the kernel placing + * hotpages in higher memory tiers. + */ + mtc->nmask = NULL; + mtc->gfp_mask |= __GFP_THISNODE; + target_page = alloc_migration_target(page, (unsigned long)&mtc); + if (target_page) + return target_page; + + mtc->gfp_mask &= ~__GFP_THISNODE; + mtc->nmask = allowed_mask; return alloc_migration_target(page, (unsigned long)&mtc); } @@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; + nodemask_t allowed_mask; + + struct migration_target_control mtc = { + /* + * Allocate from 'node', or fail quickly and quietly. + * When this happens, 'page' will likely just be discarded + * instead of migrated. + */ + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | + __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid = target_nid, + .nmask = &allowed_mask + }; if (list_empty(demote_pages)) return 0; @@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, if (target_nid == NUMA_NO_NODE) return 0; + node_get_allowed_targets(pgdat, &allowed_mask); + /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_pages, alloc_demote_page, NULL, - target_nid, MIGRATE_ASYNC, MR_DEMOTION, - &nr_succeeded); + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, + &nr_succeeded); if (current_is_kswapd()) __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded); From patchwork Fri Jun 10 13:52:27 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12877633 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C4907C43334 for ; Fri, 10 Jun 2022 13:56:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5E8D08D00B0; Fri, 10 Jun 2022 09:56:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 597D28D009C; Fri, 10 Jun 2022 09:56:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 438D78D00B0; Fri, 10 Jun 2022 09:56:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 30D1D8D009C for ; Fri, 10 Jun 2022 09:56:06 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 0C6436044C for ; Fri, 10 Jun 2022 13:56:06 +0000 (UTC) X-FDA: 79562475132.15.2715534 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf21.hostedemail.com (Postfix) with ESMTP id AAB4C1C0033 for ; Fri, 10 Jun 2022 13:56:05 +0000 (UTC) Received: from pps.filterd (m0098413.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25ADoGHn016506; Fri, 10 Jun 2022 13:54:32 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=n8T98L+a+X/tcAsJS6uEHxWawMoJAPBeztGnL9oF2oE=; b=mIaxfN/JRkBdPEFjqnnk/tfnWCAoiMJvmcySLXjIF0LV7Q+8NLHZtPlbnlYDOphFAAX6 YgNRF63G1qymg/Rfs+Va3qOjgI6AOlM68dsKqsVuhOmTtNQK75ioku8bU7V6f9SAW+P6 dEC1APNpnQc7buFFgLPfyIpYEpeSvAUFnCGd4Kvqkr1TIcecSx0sKXgo0icFzAo7AuQe 99gD4YDHeuzX5/0Kf48KStMynGgs3pX+f/s6A9HsZWf+svB0mI0hUVWNBquXuuZRLIDC hIfrHIzMMzm1hvBSDp4e56SQR0y/ThKOdMYnLA4rQ/zvgPW39FfVS6C6QvF3eNxov+1p tQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6q9gjjx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:32 +0000 Received: from m0098413.ppops.net (m0098413.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25ADpn5p023398; Fri, 10 Jun 2022 13:54:31 GMT Received: from ppma03wdc.us.ibm.com (ba.79.3fa9.ip4.static.sl-reverse.com [169.63.121.186]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6q9gjjm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:31 +0000 Received: from pps.filterd (ppma03wdc.us.ibm.com [127.0.0.1]) by ppma03wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25ADaaaS001826; Fri, 10 Jun 2022 13:54:30 GMT Received: from b03cxnp08027.gho.boulder.ibm.com (b03cxnp08027.gho.boulder.ibm.com [9.17.130.19]) by ppma03wdc.us.ibm.com with ESMTP id 3gfy1a935k-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:30 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp08027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25ADsTZ37340482 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Jun 2022 13:54:29 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A81C16A054; Fri, 10 Jun 2022 13:54:29 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id DCB666A04D; Fri, 10 Jun 2022 13:54:21 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.90.151]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 10 Jun 2022 13:54:21 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v6 11/13] mm/demotion: Update node_is_toptier to work with memory tiers Date: Fri, 10 Jun 2022 19:22:27 +0530 Message-Id: <20220610135229.182859-12-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> References: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: IfJBOki0lthPdqTpnEtpfbx5SEZ4TnTF X-Proofpoint-GUID: 0RXvcVzRo6lHsOe6QLXM6lpJ-Z_kEAP6 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-10_06,2022-06-09_02,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=0 impostorscore=0 phishscore=0 spamscore=0 mlxlogscore=999 adultscore=0 bulkscore=0 priorityscore=1501 clxscore=1015 lowpriorityscore=0 mlxscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206100056 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654869365; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=n8T98L+a+X/tcAsJS6uEHxWawMoJAPBeztGnL9oF2oE=; b=XlDLR+nrTmiBDfVlGp1VE1RQGNskzjI/IUvx4zeTtDnFbQGZzvv+HUD/7o4UYPc2B9hO2G nWCegzcbpQanujadEZM+dKcoYe3jwDCy8l/GkntdLihv086PRjhamXttLq4j+vc+nO2eXI lQCAQvHqvbYgwpVA76RPwJNiupGzAS8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654869365; a=rsa-sha256; cv=none; b=AP75P5tpEUeGyx6jOor45vZ50iLt/6uam6dV2ey9B5Z1DDt2VilMLx9BIE6EXxLe9HkRom Vj8tUdNbzPmh8PaDMHbADiFZR1Mz/Z5pO7tLW6ZszMuLtggpxZ76cxBFks2HZ/MognhNdK Tnryb5166+A4ZLcmhmvs/UZjZPN5YNc= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="mIaxfN/J"; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf21.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspam-User: Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="mIaxfN/J"; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf21.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam03 X-Stat-Signature: anarcf5iwmuirenr5bzrhhhicykqq5cp X-Rspamd-Queue-Id: AAB4C1C0033 X-HE-Tag: 1654869365-808238 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: With memory tiers support we can have memory only NUMA nodes in the top tier from which we want to avoid promotion tracking NUMA faults. Update node_is_toptier to work with memory tiers. All NUMA nodes are by default top tier nodes. With lower memory tiers added we consider all memory tiers above a memory tier having CPU NUMA nodes as top memory tier Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 6 +++++ include/linux/node.h | 5 ---- mm/huge_memory.c | 1 + mm/memory-tiers.c | 44 ++++++++++++++++++++++++++++++++++-- mm/migrate.c | 1 + mm/mprotect.c | 1 + 6 files changed, 51 insertions(+), 7 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 47841379553c..de4098f6d5d5 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -39,6 +39,7 @@ int node_reset_memory_tier(int node, int tier); struct memory_tier *node_get_memory_tier(int node); void node_put_memory_tier(struct memory_tier *memtier); void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); +bool node_is_toptier(int node); #else @@ -52,6 +53,11 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target { *targets = NODE_MASK_NONE; } + +static inline bool node_is_toptier(int node) +{ + return true; +} #endif /* CONFIG_TIERED_MEMORY */ #endif diff --git a/include/linux/node.h b/include/linux/node.h index 40d641a8bfb0..9ec680dd607f 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg, #define to_node(device) container_of(device, struct node, dev) -static inline bool node_is_toptier(int node) -{ - return node_state(node, N_CPU); -} - #endif /* _LINUX_NODE_H_ */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a77c78a2b6b5..294873d4be2b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index b2ed16dcfb03..0dae3114e22c 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -17,7 +17,7 @@ struct demotion_nodes { static void establish_migration_targets(void); static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); - +static int top_tier_rank; /* * node_demotion[] examples: * @@ -126,7 +126,7 @@ static void memory_tier_device_release(struct device *dev) if (tier->dev.id >= MAX_STATIC_MEMORY_TIERS) ida_free(&memtier_dev_id, tier->dev.id); - kfree(tier); + kfree_rcu(tier); } /* @@ -443,6 +443,31 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) rcu_read_unlock(); } +bool node_is_toptier(int node) +{ + bool toptier; + pg_data_t *pgdat; + struct memory_tier *memtier; + + pgdat = NODE_DATA(node); + if (!pgdat) + return false; + + rcu_read_lock(); + memtier = rcu_dereference(pgdat->memtier); + if (!memtier) { + toptier = true; + goto out; + } + if (memtier->rank >= top_tier_rank) + toptier = true; + else + toptier = false; +out: + rcu_read_unlock(); + return toptier; +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -592,6 +617,21 @@ static void establish_migration_targets(void) } while (1); } build_lower_tier_mask: + /* + * Promotion is allowed from a memory tier to higher + * memory tier only if the memory tier doesn't include + * compute. We want to skip promotion from a memory tier, + * if any node that is part of the memory tier have CPUs. + * Once we detect such a memory tier, we consider that tier + * as top tiper from which promotion is not allowed. + */ + list_for_each_entry_reverse(memtier, &memory_tiers, list) { + nodes_and(used, node_states[N_CPU], memtier->nodelist); + if (!nodes_empty(used)) { + top_tier_rank = memtier->rank; + break; + } + } /* * Now build the lower_tier mask for each node collecting node mask from * all memory tier below it. This allows us to fallback demotion page diff --git a/mm/migrate.c b/mm/migrate.c index 0b554625a219..78615c48fc0f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -50,6 +50,7 @@ #include #include #include +#include #include diff --git a/mm/mprotect.c b/mm/mprotect.c index ba5592655ee3..92a2fc0fa88b 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include From patchwork Fri Jun 10 13:52:28 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12877629 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E616CC433EF for ; Fri, 10 Jun 2022 13:54:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 80A8E8D00AC; Fri, 10 Jun 2022 09:54:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7BB028D009C; Fri, 10 Jun 2022 09:54:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6348A8D00AC; Fri, 10 Jun 2022 09:54:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 52AD98D009C for ; Fri, 10 Jun 2022 09:54:51 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 1FD8631A41 for ; Fri, 10 Jun 2022 13:54:51 +0000 (UTC) X-FDA: 79562471982.25.79939C8 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf29.hostedemail.com (Postfix) with ESMTP id 6089712007A for ; Fri, 10 Jun 2022 13:54:50 +0000 (UTC) Received: from pps.filterd (m0127361.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25ADDUbi026292; Fri, 10 Jun 2022 13:54:42 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=fbAp1PAXOP3RY42dNgrds/22uMoY8Gsa/tj4j5uKz/c=; b=kx1zOXSIq8yK8iHL2mJFFpD0WQVbcaRbOg+ezIcp7qEEfVOnb3PdQC2G2vyDWPRAQHEN 0KBkjgUIS0LZQeQeQal2EMffa/mdzouWIts5CuZWPtosUgEApcB9I/52Eux8Enls4+3H lBVAqpsw+yjADLEB/t4j+Gzv6vZLLC8fC+e+BIWjJKQbBcEHiEzqY89KScx9f8YVhaX8 rnE2teL1LoeSkOZv7UuPf+44zofvTBQdFC7dAGsVHUHeL3e8nKqYvonMDzMkW4oNwhjt ZnEusQoTKUgGWPOo8PGC14XGCo47aX6dujPAyLtg/elCswyrtJkmWPAzAOwRS+x97sbd Zw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6n6grnm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:42 +0000 Received: from m0127361.ppops.net (m0127361.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25ADjgxT036039; Fri, 10 Jun 2022 13:54:42 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6n6grn9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:41 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25ADalU6022295; Fri, 10 Jun 2022 13:54:40 GMT Received: from b03cxnp07028.gho.boulder.ibm.com (b03cxnp07028.gho.boulder.ibm.com [9.17.130.15]) by ppma01dal.us.ibm.com with ESMTP id 3gfy1au75a-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:40 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp07028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25ADsdJY33423754 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Jun 2022 13:54:39 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E7DED6A057; Fri, 10 Jun 2022 13:54:38 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7B62F6A04D; Fri, 10 Jun 2022 13:54:30 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.90.151]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 10 Jun 2022 13:54:30 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K . V" Subject: [PATCH v6 12/13] mm/demotion: Add documentation for memory tiering Date: Fri, 10 Jun 2022 19:22:28 +0530 Message-Id: <20220610135229.182859-13-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> References: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: bPmUNp0Hp710LfCV2ZiXuO7EM0WxH_kN X-Proofpoint-GUID: 416FCzobTHHuQ9O6CtPI5utxMJcthDjx X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-10_06,2022-06-09_02,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxlogscore=999 suspectscore=0 phishscore=0 mlxscore=0 adultscore=0 malwarescore=0 bulkscore=0 impostorscore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206100056 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654869290; a=rsa-sha256; cv=none; b=iD//BPAVlzgdos6djp70TVfRXPuSX7WJkXFhtoM+pd31jZu/vqs3CakoeUVgfc0srM3X35 hyEgJpE3Y9tmYjWCR7b2xpFssvKoicNBy8e4Jt03qmdnK0/zYTuwYjlYhZUVue+m7GAmSh CrcjeKubOsa6ZruA07iR3HvMpQFWX/w= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=kx1zOXSI; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf29.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654869290; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fbAp1PAXOP3RY42dNgrds/22uMoY8Gsa/tj4j5uKz/c=; b=8MKH+MasyccQkaFfsqhqsn7AxMTxmYp1p00ymVodWPPyfUYzhJaNZ8gABMDW8sguS35/7s qKM5XPF+cfarDwct/kEYqVwPBnY9WabGnsu8kCJh7nW/BqYpFwxIXgPuhrinyzhCM/6Geq 1c9/cK4hF6KhKz7ZczGaTPy6i6QwzOU= X-Stat-Signature: fshrydfizk199miut4dnx1hkrcir5z3m X-Rspamd-Queue-Id: 6089712007A X-Rspam-User: X-Rspamd-Server: rspam10 Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=kx1zOXSI; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf29.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-HE-Tag: 1654869290-383411 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya All N_MEMORY nodes are divided into 3 memoty tiers with rank value MEMORY_RANK_HBM_GPU, MEMORY_RANK_DRAM and MEMORY_RANK_PMEM. By default, All nodes are assigned to memory tier with rank value MEMORY_RANK_DRAM. Demotion path for all N_MEMORY nodes is prepared based on the rank value of memory tiers. This patch adds documention for memory tiering introduction, its sysfs interfaces and how demotion is performed based on memory tiers. Suggested-by: Wei Xu Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- Documentation/admin-guide/mm/index.rst | 1 + .../admin-guide/mm/memory-tiering.rst | 181 ++++++++++++++++++ 2 files changed, 182 insertions(+) create mode 100644 Documentation/admin-guide/mm/memory-tiering.rst diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index c21b5823f126..3f211cbca8c3 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -32,6 +32,7 @@ the Linux memory management. idle_page_tracking ksm memory-hotplug + memory-tiering nommu-mmap numa_memory_policy numaperf diff --git a/Documentation/admin-guide/mm/memory-tiering.rst b/Documentation/admin-guide/mm/memory-tiering.rst new file mode 100644 index 000000000000..c2c7e95c1098 --- /dev/null +++ b/Documentation/admin-guide/mm/memory-tiering.rst @@ -0,0 +1,181 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. _admin_guide_memory_tiering: + +============ +Memory tiers +============ + +This document describes explicit memory tiering support along with +demotion based on memory tiers. + +Introduction +============ + +Many systems have multiple type of memory devices e.g. GPU, DRAM and +PMEM. The memory subsystem of these systems can be called memory +tiering system because the performance of the different types of +memory is different. Memory tiers are defined based on hardware +capabilities of memory nodes. Each memory tier is assigned a rank +value that determines the memory tier position in demotion order. + +The memory tier assignment of each node is independent from each +other. Moving a node from one tier to another tier doesn't affect +the tier assignment of any other node. + +Memory tiers are used to build the demotion targets for nodes, a node +can demote its pages to any node of any lower tiers. + +Memory tier rank +================= + +Memory nodes are divided into below 3 types of memory tiers with rank value +as shown base on their hardware characteristics. + +MEMORY_RANK_HBM_GPU +MEMORY_RANK_DRAM +MEMORY_RANK_PMEM + +Memory tiers initialization and (re)assignments +=============================================== + +By default, all nodes are assigned to memory tier with default rank +DEFAULT_MEMORY_RANK which is 1 (MEMORY_RANK_DRAM). Memory tier of +memory node can be either modified through sysfs or from driver. On +hotplug, memory tier with default rank is assigned to memory node. + +Sysfs interfaces +================ + +Nodes belonging to specific tier can be read from, +/sys/devices/system/memtier/memtierN/nodelist (Read-Only) + +Where N is 0 - 2. + +Example 1: +For a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node, +node 2 is PMEM node an ideal tier layout will be + +$ cat /sys/devices/system/memtier/memtier0/nodelist +1 +$ cat /sys/devices/system/memtier/memtier1/nodelist +0 +$ cat /sys/devices/system/memtier/memtier2/nodelist +2 + +Example 2: +For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM +nodes. + +$ cat /sys/devices/system/memtier/memtier0/nodelist +cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or +directory +$ cat /sys/devices/system/memtier/memtier1/nodelist +0-1 +$ cat /sys/devices/system/memtier/memtier2/nodelist +2-3 + +Default memory tier can be read from, +/sys/devices/system/memtier/default_tier (Read-Only) + +e.g. +$ cat /sys/devices/system/memtier/default_tier +memtier1 + +Max memory tier can be read from, +/sys/devices/system/memtier/max_tier (Read-Only) + +e.g. +$ cat /sys/devices/system/memtier/max_tier +3 + +Individual node's memory tier can be read of set using, +/sys/devices/system/node/nodeN/memtier (Read-Write) + +where N = node id + +When this interface is written, Node is moved from old memory tier +to new memory tier and demotion targets for all N_MEMORY nodes are +built again. + +For example 1 mentioned above, +$ cat /sys/devices/system/node/node0/memtier +1 +$ cat /sys/devices/system/node/node1/memtier +0 +$ cat /sys/devices/system/node/node2/memtier +2 + +Creation of memory tiers from userspace +/sys/devices/system/memtier/create_tier_from_rank (Read-write) + +Additional memory tiers can be created by writing a rank value to this file. +This results in a new memory tier creation with specified rank value and empty nodelist. + +Demotion +======== + +In a system with DRAM and persistent memory, once DRAM +fills up, reclaim will start and some of the DRAM contents will be +thrown out even if there is a space in persistent memory. +Consequently allocations will, at some point, start falling over to the slower +persistent memory. + +That has two nasty properties. First, the newer allocations can end up in +the slower persistent memory. Second, reclaimed data in DRAM are just +discarded even if there are gobs of space in persistent memory that could +be used. + +Instead of page being discarded during reclaim, it can be moved to +persistent memory. Allowing page migration during reclaim enables +these systems to migrate pages from fast(higher) tiers to slow(lower) +tiers when the fast(higher) tier is under pressure. + + +Enable/Disable demotion +----------------------- + +By default demotion is disabled, it can be enabled/disabled using +below sysfs interface, + +$ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled + +preferred and allowed demotion nodes +------------------------------------ + +Preffered nodes for a specific N_MEMORY nodes are best nodes +from next possible lower memory tier. Allowed nodes for any +node are all the node available in all possible lower memory +tiers. + +Example: + +For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM +nodes, + +node distances: +node 0 1 2 3 + 0 10 20 30 40 + 1 20 10 40 30 + 2 30 40 10 40 + 3 40 30 40 10 + +memory_tiers[0] = +memory_tiers[1] = 0-1 +memory_tiers[2] = 2-3 + +node_demotion[0].preferred = 2 +node_demotion[0].allowed = 2, 3 +node_demotion[1].preferred = 3 +node_demotion[1].allowed = 3, 2 +node_demotion[2].preferred = +node_demotion[2].allowed = +node_demotion[3].preferred = +node_demotion[3].allowed = + +Memory allocation for demotion +------------------------------ + +If page needs to be demoted from any node, the kernel 1st tries +to allocate new page from node's preferred node and fallbacks to +node's allowed targets in allocation fallback order. From patchwork Fri Jun 10 13:52:29 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12877632 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9DA1DC433EF for ; Fri, 10 Jun 2022 13:55:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2DB448D00AF; Fri, 10 Jun 2022 09:55:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 28BBF8D009C; Fri, 10 Jun 2022 09:55:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 104C98D00AF; Fri, 10 Jun 2022 09:55:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 002368D009C for ; Fri, 10 Jun 2022 09:55:48 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay11.hostedemail.com (Postfix) with ESMTP id D1A91811F3 for ; Fri, 10 Jun 2022 13:55:48 +0000 (UTC) X-FDA: 79562474376.16.E29E880 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf07.hostedemail.com (Postfix) with ESMTP id 74E6540005 for ; Fri, 10 Jun 2022 13:55:48 +0000 (UTC) Received: from pps.filterd (m0098413.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25ADoHxW016649; Fri, 10 Jun 2022 13:55:41 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=a7bDqNWFTJyacCiGWS/s5MOUukAxREUpgwMRPd1dTmg=; b=gmx3Xm3F/q8k/GeF3AEMUEoM3pq1O0S6TerX3Vr15BOvghnr4Y+wYZIUVzp2vX6mVw5i 6rbAVQRbhoVgtIOo1W7JEzAuLc7VEriDN+SxtYaQu/OVJV9kM6Q+5ZhFMWWw0LP3KCUi J2ghLkvWRClg17GMTraJpfBj77FjhCd2dS1l1X7XeyOCJtUZiKX9z9Xu4MRW6qfGKQZX +V4t4gN7N1c2SSaHpHS+sP9kM/oeLJ+ElKvTstwBJioDjMmEZE61gHBa8TVlJvv2bcXx sK1983zDNb6Pm0Hrv3vohfYQixUIfzJcSMQid9EidUH6eZzMDZMvusXKQcKhwMfcDn4I Cw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6q9gkaf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:55:36 +0000 Received: from m0098413.ppops.net (m0098413.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25ADqjab026257; Fri, 10 Jun 2022 13:55:35 GMT Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6q9gk9q-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:55:35 +0000 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25ADZXLr004853; Fri, 10 Jun 2022 13:54:49 GMT Received: from b03cxnp08025.gho.boulder.ibm.com (b03cxnp08025.gho.boulder.ibm.com [9.17.130.17]) by ppma04wdc.us.ibm.com with ESMTP id 3gfy1a922m-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:49 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp08025.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25ADsmFT36438402 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Jun 2022 13:54:48 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 900806A04D; Fri, 10 Jun 2022 13:54:48 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 975486A051; Fri, 10 Jun 2022 13:54:39 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.90.151]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 10 Jun 2022 13:54:39 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v6 13/13] mm/demotion: Add sysfs ABI documentation Date: Fri, 10 Jun 2022 19:22:29 +0530 Message-Id: <20220610135229.182859-14-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> References: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: 96LtkCGvMN52ZkQucdAutDgdVy8ShBGH X-Proofpoint-GUID: Togfug4tun8xnidrVKBf_HDl7swp2yS7 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-10_06,2022-06-09_02,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=0 impostorscore=0 phishscore=0 spamscore=0 mlxlogscore=999 adultscore=0 bulkscore=0 priorityscore=1501 clxscore=1015 lowpriorityscore=0 mlxscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206100056 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654869348; a=rsa-sha256; cv=none; b=3bPx7Y70HaZgzGtjhtb0t5KUPIUNrT8BSD05Bv4mRiGhYswkzfDM/+PyzNjo3molNHMScN gcXFETy78J+7qD6xwu1Sv04L79MH5617Wq9RN5HGihXB9/U/I0TcxwEi8qtSJ/3fKxFLes 387yOm5/eTFUL9+6ECQxzlTYF1t+fuQ= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=gmx3Xm3F; spf=pass (imf07.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654869348; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=a7bDqNWFTJyacCiGWS/s5MOUukAxREUpgwMRPd1dTmg=; b=QITSzXBdAvvgFWLKcy8iTFwKWE5RLufhW7pAnBMNPWjqmSQKkPZNji4zm0flXDduzz+QOk BcrJNNCu/beI/8gTLol1aacJiASYSkEnfATbAmjQFno5YB5lBEisyIo4tx2IaOl2P/LRKW cPDg7BDdUl5Ct9bbQWhxC1xYonmw8xY= X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 74E6540005 X-Rspam-User: Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=gmx3Xm3F; spf=pass (imf07.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Stat-Signature: 6i4zs9nfu4yr4sw3p8pk7cyhyunceycr X-HE-Tag: 1654869348-552653 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add sysfs ABI documentation. Signed-off-by: Wei Xu Signed-off-by: Aneesh Kumar K.V --- .../ABI/testing/sysfs-kernel-mm-memory-tiers | 87 +++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers b/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers new file mode 100644 index 000000000000..b41d2977b0a5 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers @@ -0,0 +1,87 @@ +What: /sys/devices/system/memtier/ +Date: June 2022 +Contact: Linux memory management mailing list +Description: Interface for tiered memory + + This is the directory containing the information about memory tiers. + + Each memory tier has its own subdirectory. + + The order of memory tiers is determined by their rank values, not by + their memtier device names. A higher rank value means a higher tier. + +What: /sys/devices/system/memtier/default_tier +Date: June 2022 +Contact: Linux memory management mailing list +Description: Default memory tier + + The default memory tier to which memory would get added via hotplug + if the NUMA node is not part of any memory tier + +What: /sys/devices/system/memtier/max_tier +Date: June 2022 +Contact: Linux memory management mailing list +Description: Maximum number of memory tiers supported + + The max memory tier device ID we can create. Users can create memory + tiers in range [0 - max_tier) + +What: /sys/devices/system/memtier/create_tier_from_rank +Date: June 2022 +Contact: Linux memory management mailing list +Description: Interface to create memory tiers from userspace + + Writing to this file with a rank value results in creation of + a new memory tier with the specified rank value. This is used + by userspace to create new memory tiers. + +What: /sys/devices/system/memtier/memtierN/ +Date: June 2022 +Contact: Linux memory management mailing list +Description: Directory with details of a specific memory tier + + This is the directory containing the information about a particular + memory tier, memtierN, where N is the memtier device ID (e.g. 0, 1). + + The memtier device ID number itself is just an identifier and has no + special meaning, i.e. memtier device ID numbers do not determine the + order of memory tiers. + +What: /sys/devices/system/memtier/memtierN/rank +Date: June 2022 +Contact: Linux memory management mailing list +Description: Memory tier rank value + + + When read, list the "rank" value associated with memtierN. + + "Rank" is an opaque value. Its absolute value doesn't have any + special meaning. But the rank values of different memtiers can be + compared with each other to determine the memory tier order. + + For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and + their rank values are 100, 10, 50, then the memory tier order is: + memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier + and memtier1 is the lowest tier. + + The rank value of each memtier should be unique. + +What: /sys/devices/system/memtier/memtierN/nodelist +Date: June 2022 +Contact: Linux memory management mailing list +Description: Memory tier nodelist + + + When read, list the memory nodes in the specified tier. + +What: /sys/devices/system/node/nodeN/memtier +Date: June 2022 +Contact: Linux memory management mailing list +Description: Memory tier details for node N + + When read, list the device ID of the memory tier that the node belongs + to. Its value is empty for a CPU-only NUMA node. + + When written, the kernel moves the node into the specified memory + tier if the move is allowed. The tier assignments of all other + nodes are not affected.