From patchwork Fri Aug 12 05:57:00 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12941914 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D22EC00140 for ; Fri, 12 Aug 2022 05:57:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C35CD8E0006; Fri, 12 Aug 2022 01:57:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BBD338E0001; Fri, 12 Aug 2022 01:57:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9C2198E0006; Fri, 12 Aug 2022 01:57:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 894398E0001 for ; Fri, 12 Aug 2022 01:57:57 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 5C05A41A4C for ; Fri, 12 Aug 2022 05:57:57 +0000 (UTC) X-FDA: 79789884594.06.D96C0CA Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf07.hostedemail.com (Postfix) with ESMTP id D45AD4018F for ; Fri, 12 Aug 2022 05:57:56 +0000 (UTC) Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27C5lwCi024806; Fri, 12 Aug 2022 05:57:32 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=mPbKfYK6YwezdZ6rEc7R5V2G5nXd0mbXuKj54CWNCIE=; b=Ewn62wi9yh5FgyivXKKho5EJP7f2e1pCjv4At/K9LKnoYCWdpBECEaDJ/9GFx5Z1qsSk tnHMWIZQVOxXTlymoK+pmFIpPRbwu72tO8NqM8CbraXwWT5wwb1FEgoHa+wganN3akT/ HGkfbz02GgZTBneSHwSAdR2e/+gkFr6aBVb6+oGs+aEyGuBNGmEdQ2F6C9G4B3e4hEqy 278r11VbAd0A6dqWzPlKhddHA8QgYbpRM0Y78HefGNdG0zALFO8QN24vRkq1Fznm4SCo i+4gTOtsCjAR+9LHUqPsgyYrrKy4/Hehf9OxavDXoENOQM5Ofvyd7Fr0D0BL6ruWNlsJ xw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwh1a04x8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:57:32 +0000 Received: from m0098410.ppops.net (m0098410.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27C5v2sX023922; Fri, 12 Aug 2022 05:57:31 GMT Received: from ppma03wdc.us.ibm.com (ba.79.3fa9.ip4.static.sl-reverse.com [169.63.121.186]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwh1a04wt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:57:31 +0000 Received: from pps.filterd (ppma03wdc.us.ibm.com [127.0.0.1]) by ppma03wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27C5ab0R012228; Fri, 12 Aug 2022 05:57:30 GMT Received: from b03cxnp08026.gho.boulder.ibm.com (b03cxnp08026.gho.boulder.ibm.com [9.17.130.18]) by ppma03wdc.us.ibm.com with ESMTP id 3huww6fcrw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:57:30 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08026.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27C5vTVA9700000 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 12 Aug 2022 05:57:29 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 87E5A7805E; Fri, 12 Aug 2022 05:57:29 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 16B2978060; Fri, 12 Aug 2022 05:57:24 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.116.179]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 12 Aug 2022 05:57:23 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v14 01/10] mm/demotion: Add support for explicit memory tiers Date: Fri, 12 Aug 2022 11:27:00 +0530 Message-Id: <20220812055710.357820-2-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> References: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: X-n_sVRUi_DNaZH14roB2mFRiPdNp6Ur X-Proofpoint-ORIG-GUID: xfTz3nsxSCofZla01YaXTFmo-tQh3fpU X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-12_04,2022-08-11_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 spamscore=0 clxscore=1015 adultscore=0 lowpriorityscore=0 suspectscore=0 priorityscore=1501 impostorscore=0 mlxlogscore=999 phishscore=0 malwarescore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208120015 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660283877; a=rsa-sha256; cv=none; b=j5AY2N+ZP22t91SK+8urJXAtE+dH+F9WxtNsakqkGrsKly32InCB5+w3YPScIhAva78fr/ FPbBRN4N/WyYVaT4/qifP/sOEUVnFN8MfKjL2HT7eQVYZbCjFF3GdYcTempo/bEtJgBbfn WXXFEo6XVPvoJ+mYvUNXNekvH05c7Lo= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660283877; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mPbKfYK6YwezdZ6rEc7R5V2G5nXd0mbXuKj54CWNCIE=; b=H9v2l060KR7VLRW7LeboTmQA4j9waLlWJ+TlK/NIelMF7BpYdGUA2Fz0HgO7iFoKa4gRYb TT5vHf2KJaHSkeqccQtapn+bWcLEAglvEtwu3kvKti06etG9Ykx3f4uDs//1WowVy9oIB1 kDahRr/RFifBPncj1kYF/6LCiPaHMnc= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=Ewn62wi9; spf=pass (imf07.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Stat-Signature: 1qqg151yns8ab6gtos6yj37smwiadzej X-Rspamd-Queue-Id: D45AD4018F Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=Ewn62wi9; spf=pass (imf07.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1660283876-308167 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In the current kernel, memory tiers are defined implicitly via a demotion path relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the highest tier, and builds the tier hierarchy by establishing the per-node demotion targets based on the distances between nodes. This current memory tier kernel implementation needs to be improved for several important use cases, The current tier initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that should be put into a higher tier. The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier. With current kernel higher tier node can only be demoted to nodes with shortest distance on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that. This patch series address the above by defining memory tiers explicitly. Linux kernel presents memory devices as NUMA nodes and each memory device is of a specific type. The memory type of a device is represented by its abstract distance. A memory tier corresponds to a range of abstract distance. This allows for classifying memory devices with a specific performance range into a memory tier. This patch configures the range/chunk size to be 128. The default DRAM abstract distance is 512. We can have 4 memory tiers below the default DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511. Faster memory devices can be placed in these faster(higher) memory tiers. Slower memory devices like persistent memory will have abstract distance higher than the default DRAM level. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 15 ++++ mm/Makefile | 1 + mm/memory-tiers.c | 129 +++++++++++++++++++++++++++++++++++ 3 files changed, 145 insertions(+) create mode 100644 include/linux/memory-tiers.h create mode 100644 mm/memory-tiers.c diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h new file mode 100644 index 000000000000..bc7c1b799bef --- /dev/null +++ b/include/linux/memory-tiers.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMORY_TIERS_H +#define _LINUX_MEMORY_TIERS_H + +/* + * Each tier cover a abstrace distance chunk size of 128 + */ +#define MEMTIER_CHUNK_BITS 7 +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) +/* + * Smaller abstract distance value imply faster(higher) memory tiers. + */ +#define MEMTIER_ADISTANCE_DRAM (4 * MEMTIER_CHUNK_SIZE) + +#endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/Makefile b/mm/Makefile index 9a564f836403..488f604e77e0 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o +obj-$(CONFIG_NUMA) += memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c new file mode 100644 index 000000000000..1f494e69776a --- /dev/null +++ b/mm/memory-tiers.c @@ -0,0 +1,129 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include + +struct memory_tier { + /* hierarchy of memory tiers */ + struct list_head list; + /* list of all memory types part of this tier */ + struct list_head memory_types; + /* + * start value of abstract distance. memory tier maps + * an abstract distance range, + * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE + */ + int adistance_start; +}; + +struct memory_dev_type { + /* list of memory types that are part of same tier as this type */ + struct list_head tier_sibiling; + /* abstract distance for this specific memory type */ + int adistance; + /* Nodes of same abstract distance */ + nodemask_t nodes; + struct memory_tier *memtier; +}; + +static DEFINE_MUTEX(memory_tier_lock); +static LIST_HEAD(memory_tiers); +static struct memory_dev_type *node_memory_types[MAX_NUMNODES]; +/* + * For now we can have 4 faster memory tiers with smaller adistance + * than default DRAM tier. + */ +static struct memory_dev_type default_dram_type = { + .adistance = MEMTIER_ADISTANCE_DRAM, + .tier_sibiling = LIST_HEAD_INIT(default_dram_type.tier_sibiling), +}; + +static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype) +{ + bool found_slot = false; + struct memory_tier *memtier, *new_memtier; + int adistance = memtype->adistance; + unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE; + + lockdep_assert_held_once(&memory_tier_lock); + + /* + * If the memtype is already part of a memory tier, + * just return that. + */ + if (memtype->memtier) + return memtype->memtier; + + adistance = round_down(adistance, memtier_adistance_chunk_size); + list_for_each_entry(memtier, &memory_tiers, list) { + if (adistance == memtier->adistance_start) { + memtype->memtier = memtier; + list_add(&memtype->tier_sibiling, &memtier->memory_types); + return memtier; + } else if (adistance < memtier->adistance_start) { + found_slot = true; + break; + } + } + + new_memtier = kmalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!new_memtier) + return ERR_PTR(-ENOMEM); + + new_memtier->adistance_start = adistance; + INIT_LIST_HEAD(&new_memtier->list); + INIT_LIST_HEAD(&new_memtier->memory_types); + if (found_slot) + list_add_tail(&new_memtier->list, &memtier->list); + else + list_add_tail(&new_memtier->list, &memory_tiers); + memtype->memtier = new_memtier; + list_add(&memtype->tier_sibiling, &new_memtier->memory_types); + return new_memtier; +} + +static struct memory_tier *set_node_memory_tier(int node) +{ + struct memory_tier *memtier; + struct memory_dev_type *memtype; + + lockdep_assert_held_once(&memory_tier_lock); + + if (!node_state(node, N_MEMORY)) + return ERR_PTR(-EINVAL); + + if (!node_memory_types[node]) + node_memory_types[node] = &default_dram_type; + + memtype = node_memory_types[node]; + node_set(node, memtype->nodes); + memtier = find_create_memory_tier(memtype); + return memtier; +} + +static int __init memory_tier_init(void) +{ + int node; + struct memory_tier *memtier; + + mutex_lock(&memory_tier_lock); + /* + * Look at all the existing N_MEMORY nodes and add them to + * default memory tier or to a tier if we already have memory + * types assigned. + */ + for_each_node_state(node, N_MEMORY) { + memtier = set_node_memory_tier(node); + if (IS_ERR(memtier)) + /* + * Continue with memtiers we are able to setup + */ + break; + } + mutex_unlock(&memory_tier_lock); + + return 0; +} +subsys_initcall(memory_tier_init); From patchwork Fri Aug 12 05:57:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12941912 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD884C25B0F for ; Fri, 12 Aug 2022 05:57:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4BA7A8E0003; Fri, 12 Aug 2022 01:57:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 444258E0001; Fri, 12 Aug 2022 01:57:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2702D8E0003; Fri, 12 Aug 2022 01:57:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 146358E0001 for ; Fri, 12 Aug 2022 01:57:53 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id CB4E64082A for ; Fri, 12 Aug 2022 05:57:52 +0000 (UTC) X-FDA: 79789884384.12.CA11C4A Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf17.hostedemail.com (Postfix) with ESMTP id 33C5040193 for ; Fri, 12 Aug 2022 05:57:51 +0000 (UTC) Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27C5ls8C024650; Fri, 12 Aug 2022 05:57:38 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=frcdi/u638B9M2b81/1WPJlM0nXFEZZ57EgnDxfKs1U=; b=g7MIjDe+m7jVNsbv33kfLMoluKJUGP37rVnt2L9kf/enUkiKRZE84KEaz/kFievW+Roz HlDLQVqp1fNhlOT4apWMttIhKsOYuHRTZtypfWmLoM8T9XUuVzhRQvPqYvFc7m/m6foY GPs9QULYTeJr4E8SJORkoXG8gtOBtJIYGgdyq2g9CsXMZCe9oDZjXkxNmk1WUpu1/m3Q URbmqwweGFXrZGiIpE3UpUrEgPpgsdDhooANWiTZxo4sBxNSwb1ZzQe0pxyYmdFs3L4o czlowfKSXBQGS+jBbW1s8228M4xTdMAmemFIQjzvaL+aG0hji+O2aTlEPmB/Mnhduuex WA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwh1a0509-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:57:38 +0000 Received: from m0098410.ppops.net (m0098410.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27C5rplO014024; Fri, 12 Aug 2022 05:57:37 GMT Received: from ppma05wdc.us.ibm.com (1b.90.2fa9.ip4.static.sl-reverse.com [169.47.144.27]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwh1a04yp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:57:37 +0000 Received: from pps.filterd (ppma05wdc.us.ibm.com [127.0.0.1]) by ppma05wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27C5anEN011059; Fri, 12 Aug 2022 05:57:36 GMT Received: from b03cxnp08028.gho.boulder.ibm.com (b03cxnp08028.gho.boulder.ibm.com [9.17.130.20]) by ppma05wdc.us.ibm.com with ESMTP id 3hvcmrk06j-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:57:36 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27C5vZZU35979738 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 12 Aug 2022 05:57:35 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 975687805E; Fri, 12 Aug 2022 05:57:35 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 172AC7805C; Fri, 12 Aug 2022 05:57:30 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.116.179]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 12 Aug 2022 05:57:29 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v14 02/10] mm/demotion: Move memory demotion related code Date: Fri, 12 Aug 2022 11:27:01 +0530 Message-Id: <20220812055710.357820-3-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> References: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: QaYxhT9MhQdDB9JU1pFp4cjMzp3VGS4m X-Proofpoint-ORIG-GUID: ZciXaAMgWhP2jOOmHFk_WWyST7bb25ci X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-12_04,2022-08-11_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 spamscore=0 clxscore=1015 adultscore=0 lowpriorityscore=0 suspectscore=0 priorityscore=1501 impostorscore=0 mlxlogscore=999 phishscore=0 malwarescore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208120015 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660283872; a=rsa-sha256; cv=none; b=LnjHuyUtLPgrv0pVr7fhy0DjG+xERsQMKQ8iERi5pYB3J8zrmhwlh6k1IoJB8YRgwabbh5 FfK4vfiuSC2AypVHwJZhnpmxDaofIqI+pAoqR8V0WA1Kjc6evsw9jq5fObjkbWkL8ZchuG cUsnip6i5/OHXf+QDd6OibSlE5E3Oqo= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=g7MIjDe+; spf=pass (imf17.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660283872; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=frcdi/u638B9M2b81/1WPJlM0nXFEZZ57EgnDxfKs1U=; b=eNflluLJKmdDk+LbnR10RAI6rglToThXNqa8WoobPubmzcovNUa1VN3yA17zMMnyDVrdtO 3k3faoa0EyBzLuW0UD2K2/tGEl4+nRZyDdET6SBurujX/du+97tj1cRHryvtVDCDW0Ip7O Gdke3s4Cqrq5eIh4Oj6ZWvZ3DliSmUk= X-Stat-Signature: crtde4wcrbeos7m3bozbstajzhobzgf6 X-Rspamd-Queue-Id: 33C5040193 X-Rspam-User: X-Rspamd-Server: rspam03 Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=g7MIjDe+; spf=pass (imf17.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-HE-Tag: 1660283871-794509 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This move memory demotion related code to mm/memory-tiers.c. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 8 +++++ include/linux/migrate.h | 2 -- mm/memory-tiers.c | 64 ++++++++++++++++++++++++++++++++++++ mm/migrate.c | 60 +-------------------------------- mm/vmscan.c | 1 + 5 files changed, 74 insertions(+), 61 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index bc7c1b799bef..9fdd9572fdf9 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -12,4 +12,12 @@ */ #define MEMTIER_ADISTANCE_DRAM (4 * MEMTIER_CHUNK_SIZE) +#ifdef CONFIG_NUMA +#include +extern bool numa_demotion_enabled; + +#else + +#define numa_demotion_enabled false +#endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 22c0a0cf5e0c..96f8c84413fe 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -103,7 +103,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) extern void set_migration_target_nodes(void); extern void migrate_on_reclaim_init(void); -extern bool numa_demotion_enabled; extern int next_demotion_node(int node); #else static inline void set_migration_target_nodes(void) {} @@ -112,7 +111,6 @@ static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } -#define numa_demotion_enabled false #endif #ifdef CONFIG_COMPACTION diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 1f494e69776a..f3dc3318d931 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -3,6 +3,8 @@ #include #include #include +#include +#include #include struct memory_tier { @@ -127,3 +129,65 @@ static int __init memory_tier_init(void) return 0; } subsys_initcall(memory_tier_init); + +bool numa_demotion_enabled = false; + +#ifdef CONFIG_MIGRATION +#ifdef CONFIG_SYSFS +static ssize_t numa_demotion_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%s\n", + numa_demotion_enabled ? "true" : "false"); +} + +static ssize_t numa_demotion_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + ssize_t ret; + + ret = kstrtobool(buf, &numa_demotion_enabled); + if (ret) + return ret; + + return count; +} + +static struct kobj_attribute numa_demotion_enabled_attr = + __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, + numa_demotion_enabled_store); + +static struct attribute *numa_attrs[] = { + &numa_demotion_enabled_attr.attr, + NULL, +}; + +static const struct attribute_group numa_attr_group = { + .attrs = numa_attrs, +}; + +static int __init numa_init_sysfs(void) +{ + int err; + struct kobject *numa_kobj; + + numa_kobj = kobject_create_and_add("numa", mm_kobj); + if (!numa_kobj) { + pr_err("failed to create numa kobject\n"); + return -ENOMEM; + } + err = sysfs_create_group(numa_kobj, &numa_attr_group); + if (err) { + pr_err("failed to register numa group\n"); + goto delete_obj; + } + return 0; + +delete_obj: + kobject_put(numa_kobj); + return err; +} +subsys_initcall(numa_init_sysfs); +#endif /* CONFIG_SYSFS */ +#endif diff --git a/mm/migrate.c b/mm/migrate.c index 6a1597c92261..5d7fb417edbf 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2562,64 +2562,6 @@ void __init migrate_on_reclaim_init(void) set_migration_target_nodes(); cpus_read_unlock(); } +#endif /* CONFIG_NUMA */ -bool numa_demotion_enabled = false; - -#ifdef CONFIG_SYSFS -static ssize_t numa_demotion_enabled_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%s\n", - numa_demotion_enabled ? "true" : "false"); -} - -static ssize_t numa_demotion_enabled_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) -{ - ssize_t ret; - - ret = kstrtobool(buf, &numa_demotion_enabled); - if (ret) - return ret; - - return count; -} - -static struct kobj_attribute numa_demotion_enabled_attr = - __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, - numa_demotion_enabled_store); - -static struct attribute *numa_attrs[] = { - &numa_demotion_enabled_attr.attr, - NULL, -}; - -static const struct attribute_group numa_attr_group = { - .attrs = numa_attrs, -}; - -static int __init numa_init_sysfs(void) -{ - int err; - struct kobject *numa_kobj; - numa_kobj = kobject_create_and_add("numa", mm_kobj); - if (!numa_kobj) { - pr_err("failed to create numa kobject\n"); - return -ENOMEM; - } - err = sysfs_create_group(numa_kobj, &numa_attr_group); - if (err) { - pr_err("failed to register numa group\n"); - goto delete_obj; - } - return 0; - -delete_obj: - kobject_put(numa_kobj); - return err; -} -subsys_initcall(numa_init_sysfs); -#endif /* CONFIG_SYSFS */ -#endif /* CONFIG_NUMA */ diff --git a/mm/vmscan.c b/mm/vmscan.c index b2b1431352dc..224de380ac88 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -49,6 +49,7 @@ #include #include #include +#include #include #include From patchwork Fri Aug 12 05:57:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12941913 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1956CC282E7 for ; Fri, 12 Aug 2022 05:57:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8B3DD8E0005; Fri, 12 Aug 2022 01:57:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 83E2B8E0001; Fri, 12 Aug 2022 01:57:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 61B838E0005; Fri, 12 Aug 2022 01:57:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 511788E0001 for ; Fri, 12 Aug 2022 01:57:55 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 6DA5A1C6ECF for ; Fri, 12 Aug 2022 05:57:54 +0000 (UTC) X-FDA: 79789884468.02.D4C1D79 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf10.hostedemail.com (Postfix) with ESMTP id E9838C0188 for ; Fri, 12 Aug 2022 05:57:53 +0000 (UTC) Received: from pps.filterd (m0098419.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27C5j8OU015534; Fri, 12 Aug 2022 05:57:44 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=nW3GgvEjDYwVNglSmbqBQ4XObxtKCvF3A9NIZtVz+ZA=; b=mEI7ptY3EwtLVOmRJYeTT9eCyntxPQl7fBQjzFEDVFrEm5tpvkS0WH016XNjrfiXS/4m hxrHFhSyUV7xy3s3yw8IWqRkeetOxujT0IGTg1n5LHnzc1k0QFaIPw+VxHfRsAhRXOvG BZA/gs5eQLwwPFaYkcFvZAXUxnfNc7XdnJ64AqYxboniBjxoqOxU3BFsu6H3DXtnyQWM yLRn2JKqLEvoFviJC5OCkMVa4yW2hrN1MaMicXeND1ORfoR79U85F175qHII58mSaQpz vbSgBNFmRcMMtOIyfn1KII1cVJ4hlZRnG9sThnTlI8dM2uG09Q01JnI18iQWk990HU6E pQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwh05897m-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:57:44 +0000 Received: from m0098419.ppops.net (m0098419.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27C5pcZc010930; Fri, 12 Aug 2022 05:57:43 GMT Received: from ppma01wdc.us.ibm.com (fd.55.37a9.ip4.static.sl-reverse.com [169.55.85.253]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwh058976-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:57:43 +0000 Received: from pps.filterd (ppma01wdc.us.ibm.com [127.0.0.1]) by ppma01wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27C5Zx0g029144; Fri, 12 Aug 2022 05:57:42 GMT Received: from b03cxnp08027.gho.boulder.ibm.com (b03cxnp08027.gho.boulder.ibm.com [9.17.130.19]) by ppma01wdc.us.ibm.com with ESMTP id 3huwvqqf4w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:57:42 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27C5vfpL40305066 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 12 Aug 2022 05:57:41 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9FF2D78064; Fri, 12 Aug 2022 05:57:41 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 27E6A7805E; Fri, 12 Aug 2022 05:57:36 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.116.179]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 12 Aug 2022 05:57:35 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v14 03/10] mm/demotion: Add hotplug callbacks to handle new numa node onlined Date: Fri, 12 Aug 2022 11:27:02 +0530 Message-Id: <20220812055710.357820-4-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> References: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: zb4r_fvljeXydlg66XRY-dQoVDUkNAUJ X-Proofpoint-GUID: 3aVFs9fNWbqvaAl5zUgX2_6xmozWdORk X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-12_04,2022-08-11_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 phishscore=0 lowpriorityscore=0 clxscore=1015 mlxscore=0 bulkscore=0 priorityscore=1501 adultscore=0 suspectscore=0 spamscore=0 mlxlogscore=999 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208120015 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660283874; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nW3GgvEjDYwVNglSmbqBQ4XObxtKCvF3A9NIZtVz+ZA=; b=XRLoPa2hKSGZbBJ41EZE8CJfQMbGCfuVx4PzDbWgr7728+JxCvs8T97eeYjY/20YIEvF5L AatI5I/BjNrj8qG9SZyhnqTnMGq6g6hzfMx0WeK7xDUgARv4nNRizsT6ki/ebion2NIlzQ exGsqBWg/0sZWd1Z6mck4UOZc93/vl8= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=mEI7ptY3; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf10.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660283874; a=rsa-sha256; cv=none; b=ehwCnTe1/UIvjNB3pdgFiXEIMYwx5331UKj99Z58CGaEz95YvpexTnnDihQt6mUDD7BA42 xlic40wWV9wVaiSw9nNgneoYWIgKbxhoI2+S4vhbgnUCkLEA2fS5r43UOD0aKd1CQfm1QX DB9tgB9IE0PtQv7QPzencESZg5g7AIQ= X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: E9838C0188 X-Rspam-User: X-Stat-Signature: nhmgbr3do1mwrzjhfxdy6dfhn1w9xsj9 Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=mEI7ptY3; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf10.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-HE-Tag: 1660283873-11081 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: If the new NUMA node onlined doesn't have a abstract distance assigned, the kernel adds the NUMA node to default memory tier. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 1 + mm/memory-tiers.c | 68 ++++++++++++++++++++++++++++++++++++ 2 files changed, 69 insertions(+) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 9fdd9572fdf9..cc89876899a6 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -11,6 +11,7 @@ * Smaller abstract distance value imply faster(higher) memory tiers. */ #define MEMTIER_ADISTANCE_DRAM (4 * MEMTIER_CHUNK_SIZE) +#define MEMTIER_HOTPLUG_PRIO 100 #ifdef CONFIG_NUMA #include diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index f3dc3318d931..05f05395468a 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -5,6 +5,7 @@ #include #include #include +#include #include struct memory_tier { @@ -105,6 +106,72 @@ static struct memory_tier *set_node_memory_tier(int node) return memtier; } +static struct memory_tier *__node_get_memory_tier(int node) +{ + struct memory_dev_type *memtype; + + memtype = node_memory_types[node]; + if (memtype && node_isset(node, memtype->nodes)) + return memtype->memtier; + return NULL; +} + +static void destroy_memory_tier(struct memory_tier *memtier) +{ + list_del(&memtier->list); + kfree(memtier); +} + +static bool clear_node_memory_tier(int node) +{ + bool cleared = false; + struct memory_tier *memtier; + + memtier = __node_get_memory_tier(node); + if (memtier) { + struct memory_dev_type *memtype; + + memtype = node_memory_types[node]; + node_clear(node, memtype->nodes); + if (nodes_empty(memtype->nodes)) { + list_del(&memtype->tier_sibiling); + memtype->memtier = NULL; + if (list_empty(&memtier->memory_types)) + destroy_memory_tier(memtier); + } + cleared = true; + } + return cleared; +} + +static int __meminit memtier_hotplug_callback(struct notifier_block *self, + unsigned long action, void *_arg) +{ + struct memory_notify *arg = _arg; + + /* + * Only update the node migration order when a node is + * changing status, like online->offline. + */ + if (arg->status_change_nid < 0) + return notifier_from_errno(0); + + switch (action) { + case MEM_OFFLINE: + mutex_lock(&memory_tier_lock); + clear_node_memory_tier(arg->status_change_nid); + mutex_unlock(&memory_tier_lock); + break; + case MEM_ONLINE: + mutex_lock(&memory_tier_lock); + set_node_memory_tier(arg->status_change_nid); + mutex_unlock(&memory_tier_lock); + break; + } + + return notifier_from_errno(0); +} + static int __init memory_tier_init(void) { int node; @@ -126,6 +193,7 @@ static int __init memory_tier_init(void) } mutex_unlock(&memory_tier_lock); + hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRIO); return 0; } subsys_initcall(memory_tier_init); From patchwork Fri Aug 12 05:57:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12941915 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ECB78C00140 for ; Fri, 12 Aug 2022 05:58:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 88B6E8E0007; Fri, 12 Aug 2022 01:58:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8145B8E0001; Fri, 12 Aug 2022 01:58:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 68DDF8E0007; Fri, 12 Aug 2022 01:58:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 582D98E0001 for ; Fri, 12 Aug 2022 01:58:14 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 1B3AF141A44 for ; Fri, 12 Aug 2022 05:58:14 +0000 (UTC) X-FDA: 79789885308.29.DDDC478 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf04.hostedemail.com (Postfix) with ESMTP id A791940182 for ; Fri, 12 Aug 2022 05:58:13 +0000 (UTC) Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27C5kiZ2017849; Fri, 12 Aug 2022 05:57:50 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=e4xek80MAEoZ3xcv9+JTaXI9AFJduVTHtY7Vpz3D/kQ=; b=WuCFWxAoQYsk020J7Blkk2AIQTr11RvJ5vL0LmU25DsQiNx0liqd1q+ShZ5p+LLeKJN2 lzAK0BGDu2JSb7u5q1ymFNsgNjFsIAtsonQSmqhgkKK7wKfrYQN8U/tkhfESEVXi3/zG EyAzo3AqP00t9yauDBSlbtYYZb6IqtaxMUrIFdFhHkDJSbw3fID1ybYOewHt9n9tGW5q 6DgJsycZYZfRLVVwKJPVwPV78awW4zkZyqukK1BHjDidPeom0yOEEPjHmMNGxGYEYhJL Raj/3hM32J7HvlUgSuUMZbJtcNy1JLQ+Cgx1UgSwt0pTko+M97PoESgGKJpMuzpG4Oha Fg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwh0rr7a9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:57:49 +0000 Received: from m0098421.ppops.net (m0098421.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27C5vnDo025340; Fri, 12 Aug 2022 05:57:49 GMT Received: from ppma01wdc.us.ibm.com (fd.55.37a9.ip4.static.sl-reverse.com [169.55.85.253]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwh0rr7a2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:57:49 +0000 Received: from pps.filterd (ppma01wdc.us.ibm.com [127.0.0.1]) by ppma01wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27C5Zx0h029144; Fri, 12 Aug 2022 05:57:48 GMT Received: from b03cxnp08025.gho.boulder.ibm.com (b03cxnp08025.gho.boulder.ibm.com [9.17.130.17]) by ppma01wdc.us.ibm.com with ESMTP id 3huwvqqf57-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:57:48 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08025.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27C5vlrP52101562 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 12 Aug 2022 05:57:47 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B2F717805C; Fri, 12 Aug 2022 05:57:47 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2F58B7805E; Fri, 12 Aug 2022 05:57:42 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.116.179]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 12 Aug 2022 05:57:41 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v14 04/10] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE Date: Fri, 12 Aug 2022 11:27:03 +0530 Message-Id: <20220812055710.357820-5-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> References: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: nyrLuR72e4jHmp6XvZkr4LJWXAoRQhmK X-Proofpoint-ORIG-GUID: GLUJdk83DgjjHh04OLZnEbTFRdVIJSMH X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-12_04,2022-08-11_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 malwarescore=0 bulkscore=0 adultscore=0 lowpriorityscore=0 clxscore=1015 mlxscore=0 mlxlogscore=999 spamscore=0 priorityscore=1501 phishscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208120015 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660283893; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=e4xek80MAEoZ3xcv9+JTaXI9AFJduVTHtY7Vpz3D/kQ=; b=CMVDKdMnLaVzd8pDnrguF5KlWdoJmhuMS6jm3G7JFOf5YMHpMaiP089QVZ1M0PyE0CK/CX ZyoO9wb82EFBDv8CB+J3hxW/eUv6Ci79IKNawve1QiXtkfuO/crHDIoQJcPaxNlmM5O/PX Al4vlwrrzer3+2yJyDzgvalq97TTFp4= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=WuCFWxAo; spf=pass (imf04.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660283893; a=rsa-sha256; cv=none; b=XdGNaL1iKSfxSlZ9/MlKlX8XOeK/U4dG0cv5Yj7HJZf7rq8/PcNzFP13jzzKkTt/jqOCdo SLVh+3YS4vc2Z0yFskpk7lAFHqpKb8uXswE0Gq5CVPJhc8HmzyIpF5248q+ijusAuJ/kiz 3I1GiyicNSZagerrI+VzS0QETmelZ4s= X-Stat-Signature: eyyrqrxawu4fgiydwmndr31gc955rqtg X-Rspamd-Queue-Id: A791940182 Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=WuCFWxAo; spf=pass (imf04.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1660283893-398085 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: By default, all nodes are assigned to the default memory tier which is the memory tier designated for nodes with DRAM Set dax kmem device node's tier to slower memory tier by assigning abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE. Low-level drivers like papr_scm or ACPI NFIT can initialize memory device type to a more accurate value based on device tree details or HMAT. If the kernel doesn't find the memory type initialized, a default slower memory type is assigned by the kmem driver. Signed-off-by: Aneesh Kumar K.V --- drivers/dax/kmem.c | 42 +++++++++++++++-- include/linux/memory-tiers.h | 42 ++++++++++++++++- mm/memory-tiers.c | 91 +++++++++++++++++++++++++++--------- 3 files changed, 149 insertions(+), 26 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index a37622060fff..d88814f1c414 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -11,9 +11,17 @@ #include #include #include +#include #include "dax-private.h" #include "bus.h" +/* + * Default abstract distance assigned to the NUMA node onlined + * by DAX/kmem if the low level platform driver didn't initialize + * one for this NUMA node. + */ +#define MEMTIER_DEFAULT_DAX_ADISTANCE (MEMTIER_ADISTANCE_DRAM * 2) + /* Memory resource name used for add_memory_driver_managed(). */ static const char *kmem_name; /* Set if any memory will remain added when the driver will be unloaded. */ @@ -41,6 +49,7 @@ struct dax_kmem_data { struct resource *res[]; }; +static struct memory_dev_type *dax_slowmem_type; static int dev_dax_kmem_probe(struct dev_dax *dev_dax) { struct device *dev = &dev_dax->dev; @@ -79,11 +88,13 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) return -EINVAL; } + init_node_memory_type(numa_node, dax_slowmem_type); + + rc = -ENOMEM; data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL); if (!data) - return -ENOMEM; + goto err_dax_kmem_data; - rc = -ENOMEM; data->res_name = kstrdup(dev_name(dev), GFP_KERNEL); if (!data->res_name) goto err_res_name; @@ -155,6 +166,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) kfree(data->res_name); err_res_name: kfree(data); +err_dax_kmem_data: + clear_node_memory_type(numa_node, dax_slowmem_type); return rc; } @@ -162,6 +175,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) static void dev_dax_kmem_remove(struct dev_dax *dev_dax) { int i, success = 0; + int node = dev_dax->target_node; struct device *dev = &dev_dax->dev; struct dax_kmem_data *data = dev_get_drvdata(dev); @@ -198,6 +212,14 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax) kfree(data->res_name); kfree(data); dev_set_drvdata(dev, NULL); + /* + * Clear the memtype association on successful unplug. + * If not, we have memory blocks left which can be + * offlined/onlined later. We need to keep memory_dev_type + * for that. This implies this reference will be around + * till next reboot. + */ + clear_node_memory_type(node, dax_slowmem_type); } } #else @@ -228,9 +250,22 @@ static int __init dax_kmem_init(void) if (!kmem_name) return -ENOMEM; + dax_slowmem_type = alloc_memory_type(MEMTIER_DEFAULT_DAX_ADISTANCE); + if (IS_ERR(dax_slowmem_type)) { + rc = PTR_ERR(dax_slowmem_type); + goto err_dax_slowmem_type; + } + rc = dax_driver_register(&device_dax_kmem_driver); if (rc) - kfree_const(kmem_name); + goto error_dax_driver; + + return rc; + +error_dax_driver: + destroy_memory_type(dax_slowmem_type); +err_dax_slowmem_type: + kfree_const(kmem_name); return rc; } @@ -239,6 +274,7 @@ static void __exit dax_kmem_exit(void) dax_driver_unregister(&device_dax_kmem_driver); if (!any_hotremove_failed) kfree_const(kmem_name); + destroy_memory_type(dax_slowmem_type); } MODULE_AUTHOR("Intel Corporation"); diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index cc89876899a6..0c739508517a 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -2,6 +2,9 @@ #ifndef _LINUX_MEMORY_TIERS_H #define _LINUX_MEMORY_TIERS_H +#include +#include +#include /* * Each tier cover a abstrace distance chunk size of 128 */ @@ -13,12 +16,49 @@ #define MEMTIER_ADISTANCE_DRAM (4 * MEMTIER_CHUNK_SIZE) #define MEMTIER_HOTPLUG_PRIO 100 +struct memory_tier; +struct memory_dev_type { + /* list of memory types that are part of same tier as this type */ + struct list_head tier_sibiling; + /* abstract distance for this specific memory type */ + int adistance; + /* Nodes of same abstract distance */ + nodemask_t nodes; + struct kref kref; + struct memory_tier *memtier; +}; + #ifdef CONFIG_NUMA -#include extern bool numa_demotion_enabled; +struct memory_dev_type *alloc_memory_type(int adistance); +void destroy_memory_type(struct memory_dev_type *memtype); +void init_node_memory_type(int node, struct memory_dev_type *default_type); +void clear_node_memory_type(int node, struct memory_dev_type *memtype); #else #define numa_demotion_enabled false +/* + * CONFIG_NUMA implementation returns non NULL error. + */ +static inline struct memory_dev_type *alloc_memory_type(int adistance) +{ + return NULL; +} + +static inline void destroy_memory_type(struct memory_dev_type *memtype) +{ + +} + +static inline void init_node_memory_type(int node, struct memory_dev_type *default_type) +{ + +} + +static inline void clear_node_memory_type(int node, struct memory_dev_type *memtype) +{ + +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 05f05395468a..e52ccbcb2b27 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -1,6 +1,4 @@ // SPDX-License-Identifier: GPL-2.0 -#include -#include #include #include #include @@ -21,27 +19,10 @@ struct memory_tier { int adistance_start; }; -struct memory_dev_type { - /* list of memory types that are part of same tier as this type */ - struct list_head tier_sibiling; - /* abstract distance for this specific memory type */ - int adistance; - /* Nodes of same abstract distance */ - nodemask_t nodes; - struct memory_tier *memtier; -}; - static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); static struct memory_dev_type *node_memory_types[MAX_NUMNODES]; -/* - * For now we can have 4 faster memory tiers with smaller adistance - * than default DRAM tier. - */ -static struct memory_dev_type default_dram_type = { - .adistance = MEMTIER_ADISTANCE_DRAM, - .tier_sibiling = LIST_HEAD_INIT(default_dram_type.tier_sibiling), -}; +static struct memory_dev_type *default_dram_type; static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype) { @@ -87,6 +68,14 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty return new_memtier; } +static inline void __init_node_memory_type(int node, struct memory_dev_type *default_type) +{ + if (!node_memory_types[node]) { + node_memory_types[node] = default_type; + kref_get(&default_type->kref); + } +} + static struct memory_tier *set_node_memory_tier(int node) { struct memory_tier *memtier; @@ -97,8 +86,7 @@ static struct memory_tier *set_node_memory_tier(int node) if (!node_state(node, N_MEMORY)) return ERR_PTR(-EINVAL); - if (!node_memory_types[node]) - node_memory_types[node] = &default_dram_type; + __init_node_memory_type(node, default_dram_type); memtype = node_memory_types[node]; node_set(node, memtype->nodes); @@ -144,6 +132,57 @@ static bool clear_node_memory_tier(int node) return cleared; } +static void release_memtype(struct kref *kref) +{ + struct memory_dev_type *memtype; + + memtype = container_of(kref, struct memory_dev_type, kref); + kfree(memtype); +} + +struct memory_dev_type *alloc_memory_type(int adistance) +{ + struct memory_dev_type *memtype; + + memtype = kmalloc(sizeof(*memtype), GFP_KERNEL); + if (!memtype) + return ERR_PTR(-ENOMEM); + + memtype->adistance = adistance; + INIT_LIST_HEAD(&memtype->tier_sibiling); + memtype->nodes = NODE_MASK_NONE; + memtype->memtier = NULL; + kref_init(&memtype->kref); + return memtype; +} +EXPORT_SYMBOL_GPL(alloc_memory_type); + +void destroy_memory_type(struct memory_dev_type *memtype) +{ + kref_put(&memtype->kref, release_memtype); +} +EXPORT_SYMBOL_GPL(destroy_memory_type); + +void init_node_memory_type(int node, struct memory_dev_type *default_type) +{ + + mutex_lock(&memory_tier_lock); + __init_node_memory_type(node, default_type); + mutex_unlock(&memory_tier_lock); +} +EXPORT_SYMBOL_GPL(init_node_memory_type); + +void clear_node_memory_type(int node, struct memory_dev_type *memtype) +{ + mutex_lock(&memory_tier_lock); + if (node_memory_types[node] == memtype) { + node_memory_types[node] = NULL; + kref_put(&memtype->kref, release_memtype); + } + mutex_unlock(&memory_tier_lock); +} +EXPORT_SYMBOL_GPL(clear_node_memory_type); + static int __meminit memtier_hotplug_callback(struct notifier_block *self, unsigned long action, void *_arg) { @@ -178,6 +217,14 @@ static int __init memory_tier_init(void) struct memory_tier *memtier; mutex_lock(&memory_tier_lock); + /* + * For now we can have 4 faster memory tiers with smaller adistance + * than default DRAM tier. + */ + default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM); + if (!default_dram_type) + panic("%s() failed to allocate default DRAM tier\n", __func__); + /* * Look at all the existing N_MEMORY nodes and add them to * default memory tier or to a tier if we already have memory From patchwork Fri Aug 12 05:57:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12941921 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C76F4C00140 for ; Fri, 12 Aug 2022 06:01:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6429D8E0002; Fri, 12 Aug 2022 02:01:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5F2398E0001; Fri, 12 Aug 2022 02:01:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 46BFB8E0002; Fri, 12 Aug 2022 02:01:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 2FD818E0001 for ; Fri, 12 Aug 2022 02:01:46 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 14CC641A59 for ; Fri, 12 Aug 2022 06:01:46 +0000 (UTC) X-FDA: 79789894212.05.9994CF1 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf14.hostedemail.com (Postfix) with ESMTP id DCB69100194 for ; Fri, 12 Aug 2022 06:01:33 +0000 (UTC) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27C5gOFZ025323; Fri, 12 Aug 2022 06:01:22 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=Tu4cCUzkZb1B4Vz1FOwo5B6Z+NbTsayHHoAepZHD8SM=; b=lCpdraBQLBgORl+HPY0V1uUYkC+jqaeDLhxhhdFJVp9CRIJKHY/2aX2YEthndGKSo3aH u7e8Jr3XPwbymB7H5RTjy2OEYaISGGvgrsX2wyp/Vt7urmV1AraK/YUbo1DeMrvGYe+U V1DVc3ngX3c4FbBjTmZhcrbmgjeNrltasKR5Ic+f2iUxIlu9CCV37kvuHeTqFuHM1Af0 V1c6Ejvf/JAbAC1CL3wXaanPkBxl+USCvnomNnqpTtWdZEvSDsaemcFUrMef3VrKtQgJ 53/VTJ4lkkWPg5IdDJbMKZV3IXjbEv+ynF0atATje7B4QMc85obj8QrM1+xvMMvcIVpg 4g== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwgxu0e2q-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 06:01:22 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27C5mcSP016394; Fri, 12 Aug 2022 06:01:21 GMT Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwgxu0e22-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 06:01:21 +0000 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27C5ZPnc021376; Fri, 12 Aug 2022 05:57:55 GMT Received: from b03cxnp08025.gho.boulder.ibm.com (b03cxnp08025.gho.boulder.ibm.com [9.17.130.17]) by ppma03dal.us.ibm.com with ESMTP id 3huwvkufq4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:57:55 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08025.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27C5vsvt41353530 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 12 Aug 2022 05:57:54 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3366A7805C; Fri, 12 Aug 2022 05:57:54 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 432F27805F; Fri, 12 Aug 2022 05:57:48 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.116.179]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 12 Aug 2022 05:57:47 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v14 05/10] mm/demotion: Build demotion targets based on explicit memory tiers Date: Fri, 12 Aug 2022 11:27:04 +0530 Message-Id: <20220812055710.357820-6-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> References: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: x8oiO0BwIz9UmsoKX6PrNprdI2DsG_L8 X-Proofpoint-GUID: xHaStVtyKaJZ9dTacgBrcgm6F8z5JV8l X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-12_04,2022-08-11_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 spamscore=0 bulkscore=0 clxscore=1015 phishscore=0 priorityscore=1501 impostorscore=0 suspectscore=0 adultscore=0 mlxscore=0 malwarescore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208120015 X-Rspam-User: Authentication-Results: imf14.hostedemail.com; dkim=temperror ("DNS error when getting key") header.d=ibm.com header.s=pp1 header.b=lCpdraBQ; dmarc=temperror reason="query timed out" header.from=ibm.com (policy=temperror); spf=temperror (imf14.hostedemail.com: error in processing during lookup of aneesh.kumar@linux.ibm.com: DNS error) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Stat-Signature: kph8yb9xmy91isiir71ab4txid35eaqa X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: DCB69100194 X-HE-Tag: 1660284093-616979 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch switch the demotion target building logic to use memory tiers instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the default memory tier and additional memory tiers will be added by drivers like dax kmem. This patch builds the demotion target for a NUMA node by looking at all memory tiers below the tier to which the NUMA node belongs. The closest node in the immediately following memory tier is used as a demotion target. Since we are now only building demotion target for N_MEMORY NUMA nodes the CPU hotplug calls are removed in this patch. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 13 ++ include/linux/migrate.h | 13 -- mm/memory-tiers.c | 238 +++++++++++++++++++-- mm/migrate.c | 394 ----------------------------------- mm/vmstat.c | 4 - 5 files changed, 239 insertions(+), 423 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 0c739508517a..d0490ea4e35b 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -34,6 +34,14 @@ struct memory_dev_type *alloc_memory_type(int adistance); void destroy_memory_type(struct memory_dev_type *memtype); void init_node_memory_type(int node, struct memory_dev_type *default_type); void clear_node_memory_type(int node, struct memory_dev_type *memtype); +#ifdef CONFIG_MIGRATION +int next_demotion_node(int node); +#else +static inline int next_demotion_node(int node) +{ + return NUMA_NO_NODE; +} +#endif #else @@ -60,5 +68,10 @@ static inline void clear_node_memory_type(int node, struct memory_dev_type *memt { } + +static inline int next_demotion_node(int node) +{ + return NUMA_NO_NODE; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 96f8c84413fe..704a04f5a074 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -100,19 +100,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #endif /* CONFIG_MIGRATION */ -#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) -extern void set_migration_target_nodes(void); -extern void migrate_on_reclaim_init(void); -extern int next_demotion_node(int node); -#else -static inline void set_migration_target_nodes(void) {} -static inline void migrate_on_reclaim_init(void) {} -static inline int next_demotion_node(int node) -{ - return NUMA_NO_NODE; -} -#endif - #ifdef CONFIG_COMPACTION bool PageMovable(struct page *page); void __SetPageMovable(struct page *page, const struct movable_operations *ops); diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index e52ccbcb2b27..41a0bc06d169 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -6,6 +6,8 @@ #include #include +#include "internal.h" + struct memory_tier { /* hierarchy of memory tiers */ struct list_head list; @@ -19,10 +21,74 @@ struct memory_tier { int adistance_start; }; +struct demotion_nodes { + nodemask_t preferred; +}; + static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); static struct memory_dev_type *node_memory_types[MAX_NUMNODES]; static struct memory_dev_type *default_dram_type; +#ifdef CONFIG_MIGRATION +/* + * node_demotion[] examples: + * + * Example 1: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes. + * + * node distances: + * node 0 1 2 3 + * 0 10 20 30 40 + * 1 20 10 40 30 + * 2 30 40 10 40 + * 3 40 30 40 10 + * + * memory_tiers0 = 0-1 + * memory_tiers1 = 2-3 + * + * node_demotion[0].preferred = 2 + * node_demotion[1].preferred = 3 + * node_demotion[2].preferred = + * node_demotion[3].preferred = + * + * Example 2: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 30 + * 2 30 30 10 + * + * memory_tiers0 = 0-2 + * + * node_demotion[0].preferred = + * node_demotion[1].preferred = + * node_demotion[2].preferred = + * + * Example 3: + * + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 40 + * 2 30 40 10 + * + * memory_tiers0 = 1 + * memory_tiers1 = 0 + * memory_tiers2 = 2 + * + * node_demotion[0].preferred = 2 + * node_demotion[1].preferred = 0 + * node_demotion[2].preferred = + * + */ +static struct demotion_nodes *node_demotion __read_mostly; +#endif /* CONFIG_MIGRATION */ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype) { @@ -68,6 +134,154 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty return new_memtier; } +static struct memory_tier *__node_get_memory_tier(int node) +{ + struct memory_dev_type *memtype; + + memtype = node_memory_types[node]; + if (memtype && node_isset(node, memtype->nodes)) + return memtype->memtier; + return NULL; +} + +#ifdef CONFIG_MIGRATION +/** + * next_demotion_node() - Get the next node in the demotion path + * @node: The starting node to lookup the next node + * + * Return: node id for next memory node in the demotion path hierarchy + * from @node; NUMA_NO_NODE if @node is terminal. This does not keep + * @node online or guarantee that it *continues* to be the next demotion + * target. + */ +int next_demotion_node(int node) +{ + struct demotion_nodes *nd; + int target; + + if (!node_demotion) + return NUMA_NO_NODE; + + nd = &node_demotion[node]; + + /* + * node_demotion[] is updated without excluding this + * function from running. + * + * Make sure to use RCU over entire code blocks if + * node_demotion[] reads need to be consistent. + */ + rcu_read_lock(); + /* + * If there are multiple target nodes, just select one + * target node randomly. + * + * In addition, we can also use round-robin to select + * target node, but we should introduce another variable + * for node_demotion[] to record last selected target node, + * that may cause cache ping-pong due to the changing of + * last target node. Or introducing per-cpu data to avoid + * caching issue, which seems more complicated. So selecting + * target node randomly seems better until now. + */ + target = node_random(&nd->preferred); + rcu_read_unlock(); + + return target; +} + +static void disable_all_demotion_targets(void) +{ + int node; + + for_each_node_state(node, N_MEMORY) + node_demotion[node].preferred = NODE_MASK_NONE; + /* + * Ensure that the "disable" is visible across the system. + * Readers will see either a combination of before+disable + * state or disable+after. They will never see before and + * after state together. + */ + synchronize_rcu(); +} + +static __always_inline nodemask_t get_memtier_nodemask(struct memory_tier *memtier) +{ + nodemask_t nodes = NODE_MASK_NONE; + struct memory_dev_type *memtype; + + list_for_each_entry(memtype, &memtier->memory_types, tier_sibiling) + nodes_or(nodes, nodes, memtype->nodes); + + return nodes; +} + +/* + * Find an automatic demotion target for all memory + * nodes. Failing here is OK. It might just indicate + * being at the end of a chain. + */ +static void establish_demotion_targets(void) +{ + struct memory_tier *memtier; + struct demotion_nodes *nd; + int target = NUMA_NO_NODE, node; + int distance, best_distance; + nodemask_t tier_nodes; + + lockdep_assert_held_once(&memory_tier_lock); + + if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) + return; + + disable_all_demotion_targets(); + + for_each_node_state(node, N_MEMORY) { + best_distance = -1; + nd = &node_demotion[node]; + + memtier = __node_get_memory_tier(node); + if (!memtier || list_is_last(&memtier->list, &memory_tiers)) + continue; + /* + * Get the lower memtier to find the demotion node list. + */ + memtier = list_next_entry(memtier, list); + tier_nodes = get_memtier_nodemask(memtier); + /* + * find_next_best_node, use 'used' nodemask as a skip list. + * Add all memory nodes except the selected memory tier + * nodelist to skip list so that we find the best node from the + * memtier nodelist. + */ + nodes_andnot(tier_nodes, node_states[N_MEMORY], tier_nodes); + + /* + * Find all the nodes in the memory tier node list of same best distance. + * add them to the preferred mask. We randomly select between nodes + * in the preferred mask when allocating pages during demotion. + */ + do { + target = find_next_best_node(node, &tier_nodes); + if (target == NUMA_NO_NODE) + break; + + distance = node_distance(node, target); + if (distance == best_distance || best_distance == -1) { + best_distance = distance; + node_set(target, nd->preferred); + } else { + break; + } + } while (1); + } +} + +#else +static inline void disable_all_demotion_targets(void) {} +static inline void establish_demotion_targets(void) {} +#endif /* CONFIG_MIGRATION */ + static inline void __init_node_memory_type(int node, struct memory_dev_type *default_type) { if (!node_memory_types[node]) { @@ -94,16 +308,6 @@ static struct memory_tier *set_node_memory_tier(int node) return memtier; } -static struct memory_tier *__node_get_memory_tier(int node) -{ - struct memory_dev_type *memtype; - - memtype = node_memory_types[node]; - if (memtype && node_isset(node, memtype->nodes)) - return memtype->memtier; - return NULL; -} - static void destroy_memory_tier(struct memory_tier *memtier) { list_del(&memtier->list); @@ -186,6 +390,7 @@ EXPORT_SYMBOL_GPL(clear_node_memory_type); static int __meminit memtier_hotplug_callback(struct notifier_block *self, unsigned long action, void *_arg) { + struct memory_tier *memtier; struct memory_notify *arg = _arg; /* @@ -198,12 +403,15 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self, switch (action) { case MEM_OFFLINE: mutex_lock(&memory_tier_lock); - clear_node_memory_tier(arg->status_change_nid); + if (clear_node_memory_tier(arg->status_change_nid)) + establish_demotion_targets(); mutex_unlock(&memory_tier_lock); break; case MEM_ONLINE: mutex_lock(&memory_tier_lock); - set_node_memory_tier(arg->status_change_nid); + memtier = set_node_memory_tier(arg->status_change_nid); + if (!IS_ERR(memtier)) + establish_demotion_targets(); mutex_unlock(&memory_tier_lock); break; } @@ -216,6 +424,11 @@ static int __init memory_tier_init(void) int node; struct memory_tier *memtier; +#ifdef CONFIG_MIGRATION + node_demotion = kcalloc(nr_node_ids, sizeof(struct demotion_nodes), + GFP_KERNEL); + WARN_ON(!node_demotion); +#endif mutex_lock(&memory_tier_lock); /* * For now we can have 4 faster memory tiers with smaller adistance @@ -238,6 +451,7 @@ static int __init memory_tier_init(void) */ break; } + establish_demotion_targets(); mutex_unlock(&memory_tier_lock); hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRIO); diff --git a/mm/migrate.c b/mm/migrate.c index 5d7fb417edbf..ea86594f4bc5 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2170,398 +2170,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, return 0; } #endif /* CONFIG_NUMA_BALANCING */ - -/* - * node_demotion[] example: - * - * Consider a system with two sockets. Each socket has - * three classes of memory attached: fast, medium and slow. - * Each memory class is placed in its own NUMA node. The - * CPUs are placed in the node with the "fast" memory. The - * 6 NUMA nodes (0-5) might be split among the sockets like - * this: - * - * Socket A: 0, 1, 2 - * Socket B: 3, 4, 5 - * - * When Node 0 fills up, its memory should be migrated to - * Node 1. When Node 1 fills up, it should be migrated to - * Node 2. The migration path start on the nodes with the - * processors (since allocations default to this node) and - * fast memory, progress through medium and end with the - * slow memory: - * - * 0 -> 1 -> 2 -> stop - * 3 -> 4 -> 5 -> stop - * - * This is represented in the node_demotion[] like this: - * - * { nr=1, nodes[0]=1 }, // Node 0 migrates to 1 - * { nr=1, nodes[0]=2 }, // Node 1 migrates to 2 - * { nr=0, nodes[0]=-1 }, // Node 2 does not migrate - * { nr=1, nodes[0]=4 }, // Node 3 migrates to 4 - * { nr=1, nodes[0]=5 }, // Node 4 migrates to 5 - * { nr=0, nodes[0]=-1 }, // Node 5 does not migrate - * - * Moreover some systems may have multiple slow memory nodes. - * Suppose a system has one socket with 3 memory nodes, node 0 - * is fast memory type, and node 1/2 both are slow memory - * type, and the distance between fast memory node and slow - * memory node is same. So the migration path should be: - * - * 0 -> 1/2 -> stop - * - * This is represented in the node_demotion[] like this: - * { nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2 - * { nr=0, nodes[0]=-1, }, // Node 1 dose not migrate - * { nr=0, nodes[0]=-1, }, // Node 2 does not migrate - */ - -/* - * Writes to this array occur without locking. Cycles are - * not allowed: Node X demotes to Y which demotes to X... - * - * If multiple reads are performed, a single rcu_read_lock() - * must be held over all reads to ensure that no cycles are - * observed. - */ -#define DEFAULT_DEMOTION_TARGET_NODES 15 - -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES -#define DEMOTION_TARGET_NODES (MAX_NUMNODES - 1) -#else -#define DEMOTION_TARGET_NODES DEFAULT_DEMOTION_TARGET_NODES -#endif - -struct demotion_nodes { - unsigned short nr; - short nodes[DEMOTION_TARGET_NODES]; -}; - -static struct demotion_nodes *node_demotion __read_mostly; - -/** - * next_demotion_node() - Get the next node in the demotion path - * @node: The starting node to lookup the next node - * - * Return: node id for next memory node in the demotion path hierarchy - * from @node; NUMA_NO_NODE if @node is terminal. This does not keep - * @node online or guarantee that it *continues* to be the next demotion - * target. - */ -int next_demotion_node(int node) -{ - struct demotion_nodes *nd; - unsigned short target_nr, index; - int target; - - if (!node_demotion) - return NUMA_NO_NODE; - - nd = &node_demotion[node]; - - /* - * node_demotion[] is updated without excluding this - * function from running. RCU doesn't provide any - * compiler barriers, so the READ_ONCE() is required - * to avoid compiler reordering or read merging. - * - * Make sure to use RCU over entire code blocks if - * node_demotion[] reads need to be consistent. - */ - rcu_read_lock(); - target_nr = READ_ONCE(nd->nr); - - switch (target_nr) { - case 0: - target = NUMA_NO_NODE; - goto out; - case 1: - index = 0; - break; - default: - /* - * If there are multiple target nodes, just select one - * target node randomly. - * - * In addition, we can also use round-robin to select - * target node, but we should introduce another variable - * for node_demotion[] to record last selected target node, - * that may cause cache ping-pong due to the changing of - * last target node. Or introducing per-cpu data to avoid - * caching issue, which seems more complicated. So selecting - * target node randomly seems better until now. - */ - index = get_random_int() % target_nr; - break; - } - - target = READ_ONCE(nd->nodes[index]); - -out: - rcu_read_unlock(); - return target; -} - -/* Disable reclaim-based migration. */ -static void __disable_all_migrate_targets(void) -{ - int node, i; - - if (!node_demotion) - return; - - for_each_online_node(node) { - node_demotion[node].nr = 0; - for (i = 0; i < DEMOTION_TARGET_NODES; i++) - node_demotion[node].nodes[i] = NUMA_NO_NODE; - } -} - -static void disable_all_migrate_targets(void) -{ - __disable_all_migrate_targets(); - - /* - * Ensure that the "disable" is visible across the system. - * Readers will see either a combination of before+disable - * state or disable+after. They will never see before and - * after state together. - * - * The before+after state together might have cycles and - * could cause readers to do things like loop until this - * function finishes. This ensures they can only see a - * single "bad" read and would, for instance, only loop - * once. - */ - synchronize_rcu(); -} - -/* - * Find an automatic demotion target for 'node'. - * Failing here is OK. It might just indicate - * being at the end of a chain. - */ -static int establish_migrate_target(int node, nodemask_t *used, - int best_distance) -{ - int migration_target, index, val; - struct demotion_nodes *nd; - - if (!node_demotion) - return NUMA_NO_NODE; - - nd = &node_demotion[node]; - - migration_target = find_next_best_node(node, used); - if (migration_target == NUMA_NO_NODE) - return NUMA_NO_NODE; - - /* - * If the node has been set a migration target node before, - * which means it's the best distance between them. Still - * check if this node can be demoted to other target nodes - * if they have a same best distance. - */ - if (best_distance != -1) { - val = node_distance(node, migration_target); - if (val > best_distance) - goto out_clear; - } - - index = nd->nr; - if (WARN_ONCE(index >= DEMOTION_TARGET_NODES, - "Exceeds maximum demotion target nodes\n")) - goto out_clear; - - nd->nodes[index] = migration_target; - nd->nr++; - - return migration_target; -out_clear: - node_clear(migration_target, *used); - return NUMA_NO_NODE; -} - -/* - * When memory fills up on a node, memory contents can be - * automatically migrated to another node instead of - * discarded at reclaim. - * - * Establish a "migration path" which will start at nodes - * with CPUs and will follow the priorities used to build the - * page allocator zonelists. - * - * The difference here is that cycles must be avoided. If - * node0 migrates to node1, then neither node1, nor anything - * node1 migrates to can migrate to node0. Also one node can - * be migrated to multiple nodes if the target nodes all have - * a same best-distance against the source node. - * - * This function can run simultaneously with readers of - * node_demotion[]. However, it can not run simultaneously - * with itself. Exclusion is provided by memory hotplug events - * being single-threaded. - */ -static void __set_migration_target_nodes(void) -{ - nodemask_t next_pass; - nodemask_t this_pass; - nodemask_t used_targets = NODE_MASK_NONE; - int node, best_distance; - - /* - * Avoid any oddities like cycles that could occur - * from changes in the topology. This will leave - * a momentary gap when migration is disabled. - */ - disable_all_migrate_targets(); - - /* - * Allocations go close to CPUs, first. Assume that - * the migration path starts at the nodes with CPUs. - */ - next_pass = node_states[N_CPU]; -again: - this_pass = next_pass; - next_pass = NODE_MASK_NONE; - /* - * To avoid cycles in the migration "graph", ensure - * that migration sources are not future targets by - * setting them in 'used_targets'. Do this only - * once per pass so that multiple source nodes can - * share a target node. - * - * 'used_targets' will become unavailable in future - * passes. This limits some opportunities for - * multiple source nodes to share a destination. - */ - nodes_or(used_targets, used_targets, this_pass); - - for_each_node_mask(node, this_pass) { - best_distance = -1; - - /* - * Try to set up the migration path for the node, and the target - * migration nodes can be multiple, so doing a loop to find all - * the target nodes if they all have a best node distance. - */ - do { - int target_node = - establish_migrate_target(node, &used_targets, - best_distance); - - if (target_node == NUMA_NO_NODE) - break; - - if (best_distance == -1) - best_distance = node_distance(node, target_node); - - /* - * Visit targets from this pass in the next pass. - * Eventually, every node will have been part of - * a pass, and will become set in 'used_targets'. - */ - node_set(target_node, next_pass); - } while (1); - } - /* - * 'next_pass' contains nodes which became migration - * targets in this pass. Make additional passes until - * no more migrations targets are available. - */ - if (!nodes_empty(next_pass)) - goto again; -} - -/* - * For callers that do not hold get_online_mems() already. - */ -void set_migration_target_nodes(void) -{ - get_online_mems(); - __set_migration_target_nodes(); - put_online_mems(); -} - -/* - * This leaves migrate-on-reclaim transiently disabled between - * the MEM_GOING_OFFLINE and MEM_OFFLINE events. This runs - * whether reclaim-based migration is enabled or not, which - * ensures that the user can turn reclaim-based migration at - * any time without needing to recalculate migration targets. - * - * These callbacks already hold get_online_mems(). That is why - * __set_migration_target_nodes() can be used as opposed to - * set_migration_target_nodes(). - */ -#ifdef CONFIG_MEMORY_HOTPLUG -static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, - unsigned long action, void *_arg) -{ - struct memory_notify *arg = _arg; - - /* - * Only update the node migration order when a node is - * changing status, like online->offline. This avoids - * the overhead of synchronize_rcu() in most cases. - */ - if (arg->status_change_nid < 0) - return notifier_from_errno(0); - - switch (action) { - case MEM_GOING_OFFLINE: - /* - * Make sure there are not transient states where - * an offline node is a migration target. This - * will leave migration disabled until the offline - * completes and the MEM_OFFLINE case below runs. - */ - disable_all_migrate_targets(); - break; - case MEM_OFFLINE: - case MEM_ONLINE: - /* - * Recalculate the target nodes once the node - * reaches its final state (online or offline). - */ - __set_migration_target_nodes(); - break; - case MEM_CANCEL_OFFLINE: - /* - * MEM_GOING_OFFLINE disabled all the migration - * targets. Reenable them. - */ - __set_migration_target_nodes(); - break; - case MEM_GOING_ONLINE: - case MEM_CANCEL_ONLINE: - break; - } - - return notifier_from_errno(0); -} -#endif - -void __init migrate_on_reclaim_init(void) -{ - node_demotion = kcalloc(nr_node_ids, - sizeof(struct demotion_nodes), - GFP_KERNEL); - WARN_ON(!node_demotion); -#ifdef CONFIG_MEMORY_HOTPLUG - hotplug_memory_notifier(migrate_on_reclaim_callback, 100); -#endif - /* - * At this point, all numa nodes with memory/CPus have their state - * properly set, so we can build the demotion order now. - * Let us hold the cpu_hotplug lock just, as we could possibily have - * CPU hotplug events during boot. - */ - cpus_read_lock(); - set_migration_target_nodes(); - cpus_read_unlock(); -} #endif /* CONFIG_NUMA */ - - diff --git a/mm/vmstat.c b/mm/vmstat.c index 373d2730fcf2..35c6ff97cf29 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -28,7 +28,6 @@ #include #include #include -#include #include "internal.h" @@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu) if (!node_state(cpu_to_node(cpu), N_CPU)) { node_set_state(cpu_to_node(cpu), N_CPU); - set_migration_target_nodes(); } return 0; @@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu) return 0; node_clear_state(node, N_CPU); - set_migration_target_nodes(); return 0; } @@ -2118,7 +2115,6 @@ void __init init_mm_internals(void) start_shepherd_timer(); #endif - migrate_on_reclaim_init(); #ifdef CONFIG_PROC_FS proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op); proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op); From patchwork Fri Aug 12 05:57:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12941917 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7C2AC25B0F for ; Fri, 12 Aug 2022 05:58:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3856A8E0009; Fri, 12 Aug 2022 01:58:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 30CC08E0001; Fri, 12 Aug 2022 01:58:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0253B8E0009; Fri, 12 Aug 2022 01:58:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id E367E8E0001 for ; Fri, 12 Aug 2022 01:58:23 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B7586141A49 for ; Fri, 12 Aug 2022 05:58:23 +0000 (UTC) X-FDA: 79789885686.06.28EF291 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf24.hostedemail.com (Postfix) with ESMTP id 439A41801A1 for ; Fri, 12 Aug 2022 05:58:23 +0000 (UTC) Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27C5lBKh000673; Fri, 12 Aug 2022 05:58:03 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=yCKqTfh5fUF02yRp6rqjGhYyJCjq5J5SuJ5FIqJ2xVU=; b=X59wFeOhFkBGObJWxUagxl4v0xgHLSplRN3iXl6/3WxD05TELYdOgNdggANACJ/XAhIG YpzPohEywYqMwvttxj1ea48YMXlyEELH39LIugAHq+J5O9fZIT3jtiwHaVgKUpHghOS4 pSvwugziYebpQXdYnIw8zHYB31AsQIbHXXTLi3Yvqp5gjywb1nnBObUHQgTLIlfWqYfg PubDso07dPnPDi5GobX1FMgd3a03/8UdXUu4uwYD2uFKzi5dkfSC8HE1/CWS9XxOxFD9 sq+KqOqM3JhhAa0k55TMh0LdI10kIIyTAn6jfiKf7E23Q2OH0eNGszOt8UaonEKzHPt0 sg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwh1787a8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:03 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27C5mQcv004183; Fri, 12 Aug 2022 05:58:02 GMT Received: from ppma03wdc.us.ibm.com (ba.79.3fa9.ip4.static.sl-reverse.com [169.63.121.186]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwh17879b-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:02 +0000 Received: from pps.filterd (ppma03wdc.us.ibm.com [127.0.0.1]) by ppma03wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27C5adob012235; Fri, 12 Aug 2022 05:58:01 GMT Received: from b03cxnp07029.gho.boulder.ibm.com (b03cxnp07029.gho.boulder.ibm.com [9.17.130.16]) by ppma03wdc.us.ibm.com with ESMTP id 3huww6fcuc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:01 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp07029.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27C5w0qK11993754 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 12 Aug 2022 05:58:00 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 344467805C; Fri, 12 Aug 2022 05:58:00 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B9BD778060; Fri, 12 Aug 2022 05:57:54 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.116.179]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 12 Aug 2022 05:57:54 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v14 06/10] mm/demotion: Add pg_data_t member to track node memory tier details Date: Fri, 12 Aug 2022 11:27:05 +0530 Message-Id: <20220812055710.357820-7-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> References: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 9TkL8M-VImBBROV9hG9MfC5KLRZCRdGD X-Proofpoint-ORIG-GUID: w2wZ-S437vUYEYge-KO-5YK6XAvcrmUk X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-12_04,2022-08-11_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 suspectscore=0 clxscore=1015 bulkscore=0 adultscore=0 malwarescore=0 mlxscore=0 priorityscore=1501 impostorscore=0 phishscore=0 mlxlogscore=999 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208120015 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660283903; a=rsa-sha256; cv=none; b=CW4hgVH9kK7v+Qeq3B87kC9BnjPMoZmNizeC1NYIuL+1kr/s5JfV6jPCLamveKROeJ43aI akS8ZVTbXJXvDGHlN14gt+0Nfx+xAtDph/teTIhBAvrui1osAjL96FlQj2n5ioeJJ4oI46 pUvY2sM1I05y7gXn85q0vMmobPX3+bs= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660283903; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yCKqTfh5fUF02yRp6rqjGhYyJCjq5J5SuJ5FIqJ2xVU=; b=N5TtcJdGTK2nynkVOTC8IxjreEuPHLctXcuSxlptYvfjbHvWre50rtmCZSgLaiSPYIpIir rPhNF9nklzWb2E08GUaJY140KuVUJ0xozVUI4PMX9nMV5KiWXTdW013J8JiNysPbkWNZvo MX6CsICHSRrEudeFyyq0Z+7w0i/IRoM= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=X59wFeOh; spf=pass (imf24.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Stat-Signature: ucsckiafpo9deuhnu6z48exrb6khz6nu X-Rspamd-Queue-Id: 439A41801A1 Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=X59wFeOh; spf=pass (imf24.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1660283903-540287 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Also update different helpes to use NODE_DATA()->memtier. Since node specific memtier can change based on the reassignment of NUMA node to a different memory tiers, accessing NODE_DATA()->memtier needs to happen under an rcu read lock or memory_tier_lock. Signed-off-by: Aneesh Kumar K.V --- include/linux/mmzone.h | 3 +++ mm/memory-tiers.c | 40 +++++++++++++++++++++++++++++++++++----- 2 files changed, 38 insertions(+), 5 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e24b40c52468..7d78133fe8dd 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1012,6 +1012,9 @@ typedef struct pglist_data { /* Per-node vmstats */ struct per_cpu_nodestat __percpu *per_cpu_nodestats; atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; +#ifdef CONFIG_NUMA + struct memory_tier __rcu *memtier; +#endif } pg_data_t; #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 41a0bc06d169..315b9fe14c48 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -4,6 +4,7 @@ #include #include #include +#include #include #include "internal.h" @@ -136,12 +137,18 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty static struct memory_tier *__node_get_memory_tier(int node) { - struct memory_dev_type *memtype; + pg_data_t *pgdat; - memtype = node_memory_types[node]; - if (memtype && node_isset(node, memtype->nodes)) - return memtype->memtier; - return NULL; + pgdat = NODE_DATA(node); + if (!pgdat) + return NULL; + /* + * Since we hold memory_tier_lock, we can avoid + * RCU read locks when accessing the details. No + * parallel updates are possible here. + */ + return rcu_dereference_check(pgdat->memtier, + lockdep_is_held(&memory_tier_lock)); } #ifdef CONFIG_MIGRATION @@ -294,6 +301,8 @@ static struct memory_tier *set_node_memory_tier(int node) { struct memory_tier *memtier; struct memory_dev_type *memtype; + pg_data_t *pgdat = NODE_DATA(node); + lockdep_assert_held_once(&memory_tier_lock); @@ -305,24 +314,45 @@ static struct memory_tier *set_node_memory_tier(int node) memtype = node_memory_types[node]; node_set(node, memtype->nodes); memtier = find_create_memory_tier(memtype); + if (!IS_ERR(memtier)) + rcu_assign_pointer(pgdat->memtier, memtier); return memtier; } static void destroy_memory_tier(struct memory_tier *memtier) { list_del(&memtier->list); + /* + * synchronize_rcu in clear_node_memory_tier makes sure + * we don't have rcu access to this memory tier. + */ kfree(memtier); } static bool clear_node_memory_tier(int node) { bool cleared = false; + pg_data_t *pgdat; struct memory_tier *memtier; + pgdat = NODE_DATA(node); + if (!pgdat) + return false; + + /* + * Make sure that anybody looking at NODE_DATA who finds + * a valid memtier finds memory_dev_types with nodes still + * linked to the memtier. We achieve this by waiting for + * rcu read section to finish using synchronize_rcu. + * This also enables us to free the destroyed memory tier + * with kfree instead of kfree_rcu + */ memtier = __node_get_memory_tier(node); if (memtier) { struct memory_dev_type *memtype; + rcu_assign_pointer(pgdat->memtier, NULL); + synchronize_rcu(); memtype = node_memory_types[node]; node_clear(node, memtype->nodes); if (nodes_empty(memtype->nodes)) { From patchwork Fri Aug 12 05:57:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12941916 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0D206C19F2D for ; Fri, 12 Aug 2022 05:58:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9EB4A8E0008; Fri, 12 Aug 2022 01:58:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 973AD8E0001; Fri, 12 Aug 2022 01:58:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 814698E0008; Fri, 12 Aug 2022 01:58:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 7024A8E0001 for ; Fri, 12 Aug 2022 01:58:23 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 4D067120EE4 for ; Fri, 12 Aug 2022 05:58:23 +0000 (UTC) X-FDA: 79789885686.01.81E7793 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf29.hostedemail.com (Postfix) with ESMTP id CD93C120176 for ; Fri, 12 Aug 2022 05:58:22 +0000 (UTC) Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27C50Ae5015871; Fri, 12 Aug 2022 05:58:10 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=8MUECjGMQSujdaH/0eT6z2m/wiSwOVQCRwGN49PAYsU=; b=rjJCMrTHKGMSE0Y/C7BHsyGCnlRwEVakXLzbpkUg1Z0J+Vxtm9hAwc7c4+yJPgk6NacU sFHOQgxoL26/wYdzGAMEWB1X+b4AktRVaWdSjKZETIYQ0Am0G1KjRsp35xubwoJ11xjh RL3BNRLFb8XBc64GCMy3ODY1j34X/fgZByFcXTacfwN5c8cJoOuMTZuTIq/MuRFzwI/H UtN9PNAmDtAvOAGePiSBl5qJKaOUSkHSdI2lNb4sNHCNAGpkHTm0o+U3xDKC8Fbhv6+z LKhEXU8A6tZmiv0qq+TbppAdih8MOgb8MkSJFifKgPUsy/l3oiAOJa0ZayjITC+uJuFi 5w== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwgb69ec1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:09 +0000 Received: from m0098404.ppops.net (m0098404.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27C5lTL5032173; Fri, 12 Aug 2022 05:58:09 GMT Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwgb69eb9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:08 +0000 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27C5aVlk028489; Fri, 12 Aug 2022 05:58:07 GMT Received: from b03cxnp07027.gho.boulder.ibm.com (b03cxnp07027.gho.boulder.ibm.com [9.17.130.14]) by ppma04wdc.us.ibm.com with ESMTP id 3huww47c9h-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:07 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp07027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27C5w6Tx62849458 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 12 Aug 2022 05:58:06 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2849578063; Fri, 12 Aug 2022 05:58:06 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B7C4F7805F; Fri, 12 Aug 2022 05:58:00 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.116.179]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 12 Aug 2022 05:58:00 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v14 07/10] mm/demotion: Drop memtier from memtype Date: Fri, 12 Aug 2022 11:27:06 +0530 Message-Id: <20220812055710.357820-8-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> References: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: hWKKUk7g7nnxI1cxVL-RwTq-y2vreQAa X-Proofpoint-ORIG-GUID: vHZqsdKczvJaMhBT17wCrLpA7DN7nj_9 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-12_04,2022-08-11_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 malwarescore=0 clxscore=1015 priorityscore=1501 lowpriorityscore=0 phishscore=0 bulkscore=0 spamscore=0 suspectscore=0 impostorscore=0 adultscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208120015 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660283903; a=rsa-sha256; cv=none; b=w9I/Q3uDzs2RHRhCiJ6r9d8Kpb8LG0FAD/nRdzRsls6wqRzB7jbgsN4fQRRlZoUsoi+5fl bgpRTkaWCwW+rus7qAkD0JGcgzbR4EFqGiUjAjCv99X5GRAvVPvDtLKWtb3FQxvr6AHgVd Y40ShHFO/e0BRcvEUiQ6UD/xVvIVtU8= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=rjJCMrTH; spf=pass (imf29.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660283903; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8MUECjGMQSujdaH/0eT6z2m/wiSwOVQCRwGN49PAYsU=; b=4TFzW3Ix7O1tYk4W1S0zo66Ct6NDX2+aYvqHaED2FgfBmK5LQAqWLldusSTnGC52svvPo5 K1fOVKvodYh6KnLLE2rEddo0AwfRz5ry0jG8dtzsbIryOk0XuxcTCZfWHjPVnw0Erbjpi5 VZJy9bLqRKQWf/4jomzAzFADErym3JI= X-Stat-Signature: 5fko4m69wd5p36eyj5uw4x86ufj8gsdf X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: CD93C120176 Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=rjJCMrTH; spf=pass (imf29.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspam-User: X-HE-Tag: 1660283902-990924 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Now that we track node-specific memtier in pg_data_t, we can drop memtier from memtype. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 1 - mm/memory-tiers.c | 16 +++++++++------- 2 files changed, 9 insertions(+), 8 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index d0490ea4e35b..dd86323d2ba0 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -25,7 +25,6 @@ struct memory_dev_type { /* Nodes of same abstract distance */ nodemask_t nodes; struct kref kref; - struct memory_tier *memtier; }; #ifdef CONFIG_NUMA diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 315b9fe14c48..9d53d4c14a5e 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -100,17 +100,22 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty lockdep_assert_held_once(&memory_tier_lock); + adistance = round_down(adistance, memtier_adistance_chunk_size); /* * If the memtype is already part of a memory tier, * just return that. */ - if (memtype->memtier) - return memtype->memtier; + if (!list_empty(&memtype->tier_sibiling)) { + list_for_each_entry(memtier, &memory_tiers, list) { + if (adistance == memtier->adistance_start) + return memtier; + } + WARN_ON(1); + return ERR_PTR(-EINVAL); + } - adistance = round_down(adistance, memtier_adistance_chunk_size); list_for_each_entry(memtier, &memory_tiers, list) { if (adistance == memtier->adistance_start) { - memtype->memtier = memtier; list_add(&memtype->tier_sibiling, &memtier->memory_types); return memtier; } else if (adistance < memtier->adistance_start) { @@ -130,7 +135,6 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty list_add_tail(&new_memtier->list, &memtier->list); else list_add_tail(&new_memtier->list, &memory_tiers); - memtype->memtier = new_memtier; list_add(&memtype->tier_sibiling, &new_memtier->memory_types); return new_memtier; } @@ -357,7 +361,6 @@ static bool clear_node_memory_tier(int node) node_clear(node, memtype->nodes); if (nodes_empty(memtype->nodes)) { list_del(&memtype->tier_sibiling); - memtype->memtier = NULL; if (list_empty(&memtier->memory_types)) destroy_memory_tier(memtier); } @@ -385,7 +388,6 @@ struct memory_dev_type *alloc_memory_type(int adistance) memtype->adistance = adistance; INIT_LIST_HEAD(&memtype->tier_sibiling); memtype->nodes = NODE_MASK_NONE; - memtype->memtier = NULL; kref_init(&memtype->kref); return memtype; } From patchwork Fri Aug 12 05:57:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12941919 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B510AC00140 for ; Fri, 12 Aug 2022 05:58:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 557358E000B; Fri, 12 Aug 2022 01:58:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4DF888E0001; Fri, 12 Aug 2022 01:58:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 317378E000B; Fri, 12 Aug 2022 01:58:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 1F8DE8E0001 for ; Fri, 12 Aug 2022 01:58:33 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id EF91CA79D1 for ; Fri, 12 Aug 2022 05:58:32 +0000 (UTC) X-FDA: 79789886064.15.72EDDA5 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf31.hostedemail.com (Postfix) with ESMTP id 8EE7220065 for ; Fri, 12 Aug 2022 05:58:32 +0000 (UTC) Received: from pps.filterd (m0098419.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27C5j9dw015631; Fri, 12 Aug 2022 05:58:15 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=W1OvLurwE3gPRIvrZO3j/XzYwxfgxpO/vh6DKbrzJH0=; b=VnWD8C8+8XKM9unCYr/twVWQXPMPyoRk69iCVPNoRWyIqvsknGwAlGQDlUE7orQ+H+9Z qo4CwZjaFSG48ILIInPNubSSOyETxZVhAjjMttF+G7V2aklk1/VHZ1zAKvNtpu97ofLX dZNajCUoYhzBsH/khZSzlQ9kR1fL6V1RyILsOoloja1N80S6StzQc9KsqMJYZZ2VUI8p Q+9FgeiPxRFTs5CW7rcYj8/0/Wq2cm7d9olpD2vbLCb8kYkC9h0QpcBZ6lQzIbsWhAnZ MqAfSGFKh+81IOSaO1y2iE9c9fc84jIZSFXuKElc2KgP+FAMxatlty2qWcv5L+KlsS63 wg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwh0589kw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:14 +0000 Received: from m0098419.ppops.net (m0098419.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27C5uM6o023158; Fri, 12 Aug 2022 05:58:14 GMT Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwh0589kg-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:14 +0000 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27C5ZNUF021371; Fri, 12 Aug 2022 05:58:13 GMT Received: from b03cxnp07028.gho.boulder.ibm.com (b03cxnp07028.gho.boulder.ibm.com [9.17.130.15]) by ppma03dal.us.ibm.com with ESMTP id 3huwvkufru-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:13 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp07028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27C5wCZs32768506 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 12 Aug 2022 05:58:12 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1219D78064; Fri, 12 Aug 2022 05:58:12 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id AC0DF7805E; Fri, 12 Aug 2022 05:58:06 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.116.179]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 12 Aug 2022 05:58:06 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K . V" Subject: [PATCH v14 08/10] mm/demotion: Demote pages according to allocation fallback order Date: Fri, 12 Aug 2022 11:27:07 +0530 Message-Id: <20220812055710.357820-9-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> References: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: l3zb9Ti-ULTWCBi-qOss8TBeU-zTdYWT X-Proofpoint-GUID: CQUPrWrxrnJvcVTww5vPwkr32QFCMazt X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-12_04,2022-08-11_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 phishscore=0 lowpriorityscore=0 clxscore=1015 mlxscore=0 bulkscore=0 priorityscore=1501 adultscore=0 suspectscore=0 spamscore=0 mlxlogscore=999 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208120015 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660283912; a=rsa-sha256; cv=none; b=x5KDkX4YVIqpFuotiCr/QUIgMApY3pD83M/QrZny79NkJRxL+nzOIbMhbFYRA84qoLeEuf nNcZYNYUCYOiUWVE0bC8kNpUnFUWAFTXtYcs6Y3yTN/LpCCFOd5KZczBgtHXd0o/MmmQQa hXUigBBujw8s3I0IMvuilfZFxvJf7rM= ARC-Authentication-Results: i=1; imf31.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=VnWD8C8+; spf=pass (imf31.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660283912; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=W1OvLurwE3gPRIvrZO3j/XzYwxfgxpO/vh6DKbrzJH0=; b=ps7PV5ZJep5gYNQu1jPYA0l8NfADsNC8N/HyLPgkHGMLFMU/4uq2NyJPYQ/FgyygX670Pz jZF512LVJAugRlmUeKI+MKJYvQy3UAyO5Uq4BGtX0tgwHn13SppmMRJtDTcwQAXeY9CNr9 enUUYQ2sFkDHeJNAeSyZNPLKrKeDGQE= X-Stat-Signature: hqtr4iznsjou6dyuj766cssemnbd4sru X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 8EE7220065 Authentication-Results: imf31.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=VnWD8C8+; spf=pass (imf31.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspam-User: X-HE-Tag: 1660283912-333866 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya Currently, a higher tier node can only be demoted to selected nodes on the next lower tier as defined by the demotion path. This strict demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space). This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that currently. This patch adds support to get all the allowed demotion targets for a memory tier. demote_page_list() function is now modified to utilize this allowed node mask as the fallback allocation mask. Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 12 ++++++++ mm/memory-tiers.c | 51 +++++++++++++++++++++++++++++-- mm/vmscan.c | 58 ++++++++++++++++++++++++++---------- 3 files changed, 103 insertions(+), 18 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index dd86323d2ba0..6fdff436c205 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -5,6 +5,7 @@ #include #include #include +#include /* * Each tier cover a abstrace distance chunk size of 128 */ @@ -35,11 +36,17 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type); void clear_node_memory_type(int node, struct memory_dev_type *memtype); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); #else static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } + +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + *targets = NODE_MASK_NONE; +} #endif #else @@ -72,5 +79,10 @@ static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } + +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + *targets = NODE_MASK_NONE; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 9d53d4c14a5e..2db4b4116a28 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -4,7 +4,6 @@ #include #include #include -#include #include #include "internal.h" @@ -20,6 +19,8 @@ struct memory_tier { * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE */ int adistance_start; + /* All the nodes that are part of all the lower memory tiers. */ + nodemask_t lower_tier_mask; }; struct demotion_nodes { @@ -156,6 +157,24 @@ static struct memory_tier *__node_get_memory_tier(int node) } #ifdef CONFIG_MIGRATION +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + struct memory_tier *memtier; + + /* + * pg_data_t.memtier updates includes a synchronize_rcu() + * which ensures that we either find NULL or a valid memtier + * in NODE_DATA. protect the access via rcu_read_lock(); + */ + rcu_read_lock(); + memtier = rcu_dereference(pgdat->memtier); + if (memtier) + *targets = memtier->lower_tier_mask; + else + *targets = NODE_MASK_NONE; + rcu_read_unlock(); +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -203,10 +222,19 @@ int next_demotion_node(int node) static void disable_all_demotion_targets(void) { + struct memory_tier *memtier; int node; - for_each_node_state(node, N_MEMORY) + for_each_node_state(node, N_MEMORY) { node_demotion[node].preferred = NODE_MASK_NONE; + /* + * We are holding memory_tier_lock, it is safe + * to access pgda->memtier. + */ + memtier = __node_get_memory_tier(node); + if (memtier) + memtier->lower_tier_mask = NODE_MASK_NONE; + } /* * Ensure that the "disable" is visible across the system. * Readers will see either a combination of before+disable @@ -238,7 +266,7 @@ static void establish_demotion_targets(void) struct demotion_nodes *nd; int target = NUMA_NO_NODE, node; int distance, best_distance; - nodemask_t tier_nodes; + nodemask_t tier_nodes, lower_tier; lockdep_assert_held_once(&memory_tier_lock); @@ -286,6 +314,23 @@ static void establish_demotion_targets(void) } } while (1); } + /* + * Now build the lower_tier mask for each node collecting node mask from + * all memory tier below it. This allows us to fallback demotion page + * allocation to a set of nodes that is closer the above selected + * perferred node. + */ + lower_tier = node_states[N_MEMORY]; + list_for_each_entry(memtier, &memory_tiers, list) { + /* + * Keep removing current tier from lower_tier nodes, + * This will remove all nodes in current and above + * memory tier from the lower_tier mask. + */ + tier_nodes = get_memtier_nodemask(memtier); + nodes_andnot(lower_tier, lower_tier, tier_nodes); + memtier->lower_tier_mask = lower_tier; + } } #else diff --git a/mm/vmscan.c b/mm/vmscan.c index 224de380ac88..500b9054be18 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1521,21 +1521,34 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } -static struct page *alloc_demote_page(struct page *page, unsigned long node) +static struct page *alloc_demote_page(struct page *page, unsigned long private) { - struct migration_target_control mtc = { - /* - * Allocate from 'node', or fail quickly and quietly. - * When this happens, 'page' will likely just be discarded - * instead of migrated. - */ - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | - __GFP_THISNODE | __GFP_NOWARN | - __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid = node - }; + struct page *target_page; + nodemask_t *allowed_mask; + struct migration_target_control *mtc; + + mtc = (struct migration_target_control *)private; + + allowed_mask = mtc->nmask; + /* + * make sure we allocate from the target node first also trying to + * demote or reclaim pages from the target node via kswapd if we are + * low on free memory on target node. If we don't do this and if + * we have free memory on the slower(lower) memtier, we would start + * allocating pages from slower(lower) memory tiers without even forcing + * a demotion of cold pages from the target memtier. This can result + * in the kernel placing hot pages in slower(lower) memory tiers. + */ + mtc->nmask = NULL; + mtc->gfp_mask |= __GFP_THISNODE; + target_page = alloc_migration_target(page, (unsigned long)mtc); + if (target_page) + return target_page; - return alloc_migration_target(page, (unsigned long)&mtc); + mtc->gfp_mask &= ~__GFP_THISNODE; + mtc->nmask = allowed_mask; + + return alloc_migration_target(page, (unsigned long)mtc); } /* @@ -1548,6 +1561,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; + nodemask_t allowed_mask; + + struct migration_target_control mtc = { + /* + * Allocate from 'node', or fail quickly and quietly. + * When this happens, 'page' will likely just be discarded + * instead of migrated. + */ + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | + __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid = target_nid, + .nmask = &allowed_mask + }; if (list_empty(demote_pages)) return 0; @@ -1555,10 +1581,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, if (target_nid == NUMA_NO_NODE) return 0; + node_get_allowed_targets(pgdat, &allowed_mask); + /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_pages, alloc_demote_page, NULL, - target_nid, MIGRATE_ASYNC, MR_DEMOTION, - &nr_succeeded); + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, + &nr_succeeded); if (current_is_kswapd()) __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded); From patchwork Fri Aug 12 05:57:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12941918 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DFDC6C00140 for ; Fri, 12 Aug 2022 05:58:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5F4E88E000A; Fri, 12 Aug 2022 01:58:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 57C538E0001; Fri, 12 Aug 2022 01:58:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3D01C8E000A; Fri, 12 Aug 2022 01:58:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 2DEBA8E0001 for ; Fri, 12 Aug 2022 01:58:29 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 0622D41A4E for ; Fri, 12 Aug 2022 05:58:29 +0000 (UTC) X-FDA: 79789885938.02.DE43BC6 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf14.hostedemail.com (Postfix) with ESMTP id 7365B1000B5 for ; Fri, 12 Aug 2022 05:58:28 +0000 (UTC) Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27C5lBol000700; Fri, 12 Aug 2022 05:58:21 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=TEXzQrkoWkfpWIYMlURGzW1J4m7v0vVBhMH3v/YjR9I=; b=WdBieX6DOMCmDkuCao/Z+WFcp2p0JZEu6IdnzreIPxZxrlX/BLgMrP0zxWlUDNj1MTIA /etSsogG4XN008exH+7uCoI8pJRMNmEEVZa1ROAZOp9slnAsmkLZi2lG6kbhzXG/bj8u zHEOzd4fTFdnHYpXL9YSRTHXRqKY5z5shm8JfKGBKOStxMUPaQK22AAbcM3UtJBUZ0cb XMJi3r0Dmjkf40rtDsQX2PvNXJccjwRGkxrLcgN/mpKh5CbBN0MA9NKTRmyc2XgiVtzR qLBAUzX5jObOMEvl4k15ByFVy/RvHtcsWgP1u3gsP1MlWdX0dbP7nQUNKD1J5mRsorDf 0w== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwh1787hr-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:20 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27C5v06i033896; Fri, 12 Aug 2022 05:58:20 GMT Received: from ppma02dal.us.ibm.com (a.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.10]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwh1787h9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:20 +0000 Received: from pps.filterd (ppma02dal.us.ibm.com [127.0.0.1]) by ppma02dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27C5b9Kh017978; Fri, 12 Aug 2022 05:58:19 GMT Received: from b03cxnp08026.gho.boulder.ibm.com (b03cxnp08026.gho.boulder.ibm.com [9.17.130.18]) by ppma02dal.us.ibm.com with ESMTP id 3hvkjhb2nn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:19 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08026.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27C5wI7Z60621060 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 12 Aug 2022 05:58:18 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 13EBE7805E; Fri, 12 Aug 2022 05:58:18 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9FA2878063; Fri, 12 Aug 2022 05:58:12 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.116.179]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 12 Aug 2022 05:58:12 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v14 09/10] mm/demotion: Update node_is_toptier to work with memory tiers Date: Fri, 12 Aug 2022 11:27:08 +0530 Message-Id: <20220812055710.357820-10-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> References: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: W8Se9rfbhWRPW6j73iI2MzV1AmVn2MtI X-Proofpoint-ORIG-GUID: i3JMutiuEZr_Ke_MLwwpxsdBVY_dPZca X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-12_04,2022-08-11_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 suspectscore=0 clxscore=1015 bulkscore=0 adultscore=0 malwarescore=0 mlxscore=0 priorityscore=1501 impostorscore=0 phishscore=0 mlxlogscore=999 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208120015 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660283908; a=rsa-sha256; cv=none; b=C3z2LDdiLnaVln1jFjprwl3bH9pOWB9DL9jmr1sdxYPBJoZffTvzhSJRoKHx6RG7HX4xhc bb6jU3vSVfn0LJICHnH335T+Mn3lzZqP9TKw/z++N2LWBHkFQ40al8wv5XCdN85TPaFGK6 IzWUo3EKeBm/eOjuGSyhlG5KolJoguk= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=WdBieX6D; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf14.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660283908; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TEXzQrkoWkfpWIYMlURGzW1J4m7v0vVBhMH3v/YjR9I=; b=i56bmaWOU3CfVix+ftPRjH3Uw0NMOsh42WpPDuhKvcNdUFthzaCMdRDg1opK0NH43wHAT9 fRY3xuPHInDdBA0GVvp/+K/4qbkQ3INZ7TE8jC8bq5N7qk8iCgeXyPABJVi4CPV0UdVUxp Dq71BjjpczvonnaAjfJ5owInL9XMri8= X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 7365B1000B5 Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=WdBieX6D; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf14.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspam-User: X-Stat-Signature: fafxjetborpa1pcf14jj5o1to6qp46qa X-HE-Tag: 1660283908-672563 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: With memory tier support we can have memory only NUMA nodes in the top tier from which we want to avoid promotion tracking NUMA faults. Update node_is_toptier to work with memory tiers. All NUMA nodes are by default top tier nodes. With lower(slower) memory tiers added we consider all memory tiers above a memory tier having CPU NUMA nodes as a top memory tier Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 11 +++++++++ include/linux/node.h | 5 ---- mm/huge_memory.c | 1 + mm/memory-tiers.c | 46 ++++++++++++++++++++++++++++++++++++ mm/migrate.c | 1 + mm/mprotect.c | 1 + 6 files changed, 60 insertions(+), 5 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 6fdff436c205..9198d69afaa9 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -37,6 +37,7 @@ void clear_node_memory_type(int node, struct memory_dev_type *memtype); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); +bool node_is_toptier(int node); #else static inline int next_demotion_node(int node) { @@ -47,6 +48,11 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target { *targets = NODE_MASK_NONE; } + +static inline bool node_is_toptier(int node) +{ + return true; +} #endif #else @@ -84,5 +90,10 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target { *targets = NODE_MASK_NONE; } + +static inline bool node_is_toptier(int node) +{ + return true; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/node.h b/include/linux/node.h index 40d641a8bfb0..9ec680dd607f 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg, #define to_node(device) container_of(device, struct node, dev) -static inline bool node_is_toptier(int node) -{ - return node_state(node, N_CPU); -} - #endif /* _LINUX_NODE_H_ */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8a7c1b344abe..1e9357576f2d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -36,6 +36,7 @@ #include #include #include +#include #include #include diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 2db4b4116a28..165ebbbac30d 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -32,6 +32,7 @@ static LIST_HEAD(memory_tiers); static struct memory_dev_type *node_memory_types[MAX_NUMNODES]; static struct memory_dev_type *default_dram_type; #ifdef CONFIG_MIGRATION +static int top_tier_adistance; /* * node_demotion[] examples: * @@ -157,6 +158,31 @@ static struct memory_tier *__node_get_memory_tier(int node) } #ifdef CONFIG_MIGRATION +bool node_is_toptier(int node) +{ + bool toptier; + pg_data_t *pgdat; + struct memory_tier *memtier; + + pgdat = NODE_DATA(node); + if (!pgdat) + return false; + + rcu_read_lock(); + memtier = rcu_dereference(pgdat->memtier); + if (!memtier) { + toptier = true; + goto out; + } + if (memtier->adistance_start < top_tier_adistance) + toptier = true; + else + toptier = false; +out: + rcu_read_unlock(); + return toptier; +} + void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) { struct memory_tier *memtier; @@ -314,6 +340,26 @@ static void establish_demotion_targets(void) } } while (1); } + /* + * Promotion is allowed from a memory tier to higher + * memory tier only if the memory tier doesn't include + * compute. We want to skip promotion from a memory tier, + * if any node that is part of the memory tier have CPUs. + * Once we detect such a memory tier, we consider that tier + * as top tiper from which promotion is not allowed. + */ + list_for_each_entry_reverse(memtier, &memory_tiers, list) { + tier_nodes = get_memtier_nodemask(memtier); + nodes_and(tier_nodes, node_states[N_CPU], tier_nodes); + if (!nodes_empty(tier_nodes)) { + /* + * abstract distance below the max value of this memtier + * is considered toptier. + */ + top_tier_adistance = memtier->adistance_start + MEMTIER_CHUNK_SIZE; + break; + } + } /* * Now build the lower_tier mask for each node collecting node mask from * all memory tier below it. This allows us to fallback demotion page diff --git a/mm/migrate.c b/mm/migrate.c index ea86594f4bc5..55e7718cfe45 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -50,6 +50,7 @@ #include #include #include +#include #include diff --git a/mm/mprotect.c b/mm/mprotect.c index 3a23dde73723..61cd80831b04 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include From patchwork Fri Aug 12 05:57:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12941920 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B661C00140 for ; Fri, 12 Aug 2022 05:58:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3A8F98E0003; Fri, 12 Aug 2022 01:58:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 332548E0001; Fri, 12 Aug 2022 01:58:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1AD0E8E0003; Fri, 12 Aug 2022 01:58:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 0B3DE8E0001 for ; Fri, 12 Aug 2022 01:58:39 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id E4C9440E8D for ; Fri, 12 Aug 2022 05:58:38 +0000 (UTC) X-FDA: 79789886316.03.0802F1C Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf21.hostedemail.com (Postfix) with ESMTP id 674031C01BD for ; Fri, 12 Aug 2022 05:58:38 +0000 (UTC) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27C5gOIT025344; Fri, 12 Aug 2022 05:58:27 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=ZSg3RqRbN4v0XDqmeX0HctksS/vI4TIlPh6CTw4ub/o=; b=aTEKFiH2jn9R7FH3NLsluCvKeM97rSfPg8Gjslz0S7TUoVLmZ4M5Ku7l1diEK6eF8gme jI4+0EKSh4N/UJ141raJOyKMU9z/uO1skW4xOy5QmGRGkMXqSEbGmhjrNNH1MCXuZLAM qIcDqtsOWML1yxCHzuwTDvoIv3l/iO2ux1eQ5EHL1WRYitIDCBQEV5Ze1koNa0umDWjn m4tEuDDqUS0obhSfmjn+y04F5LyyVIFc+ooXv2++3tsq2hfcKPKwALyRC0sm7Gqi8UrH 6TayYpJA3QgROBoO77VZ6i6j7gyuE1ejYqLBGirxLGhZ81IV4AT4bwlkRPpgS1oRtDRh rA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwgxu0b3u-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:27 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27C5iRP9032199; Fri, 12 Aug 2022 05:58:26 GMT Received: from ppma02wdc.us.ibm.com (aa.5b.37a9.ip4.static.sl-reverse.com [169.55.91.170]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hwgxu0b38-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:26 +0000 Received: from pps.filterd (ppma02wdc.us.ibm.com [127.0.0.1]) by ppma02wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27C5ZPGm031932; Fri, 12 Aug 2022 05:58:24 GMT Received: from b03cxnp08028.gho.boulder.ibm.com (b03cxnp08028.gho.boulder.ibm.com [9.17.130.20]) by ppma02wdc.us.ibm.com with ESMTP id 3huwvffcd8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 12 Aug 2022 05:58:24 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27C5wO718979186 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 12 Aug 2022 05:58:24 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 13F807805F; Fri, 12 Aug 2022 05:58:24 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 99AAB7805E; Fri, 12 Aug 2022 05:58:18 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.116.179]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 12 Aug 2022 05:58:18 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v14 10/10] lib/nodemask: Optimize node_random for nodemask with single NUMA node Date: Fri, 12 Aug 2022 11:27:09 +0530 Message-Id: <20220812055710.357820-11-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> References: <20220812055710.357820-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: uFfWbS563dvPIlfQCnFHt3nMfCBMyM4D X-Proofpoint-GUID: ObaSRfe0pp8Njs6KE-KmROPq2p2pBf42 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-12_04,2022-08-11_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 spamscore=0 bulkscore=0 clxscore=1015 phishscore=0 priorityscore=1501 impostorscore=0 suspectscore=0 adultscore=0 mlxscore=0 malwarescore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208120015 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660283918; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZSg3RqRbN4v0XDqmeX0HctksS/vI4TIlPh6CTw4ub/o=; b=lZPnVx1JnqdSRiGI7esmkCL0luE2XH1nE3vX/xpaXgdh7dm5gG7mA2FQ0xRfN0xoKZzgSN ySFFCqv3GSxvVtb7AcqwIpXgCjg8fEWJLLxqZXGm537EsQZ5+6jud/diYi6nYXpcZfawhK ZAYcQDF1AoeLopzwpYLWuz0TolKyByo= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=aTEKFiH2; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf21.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660283918; a=rsa-sha256; cv=none; b=zX8F7jY1bXQi8wruE1GL2pLcyQbNVFlHUvc9ynzGDLOPyBWKzpxt37xY81l4AAHlQPfgOL DvMENJC7c1HDXJsEaU8qyBhNL78Bd7AaRegfyurRPq+6Syl9JZ89Fte3HBc/1qOpkITD8h /9NCQRvbAjOoX769AS8ODvlWY++HLhQ= X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 674031C01BD X-Rspam-User: X-Stat-Signature: yhbfbmkn8fre3efqm19ts75dtr4ehn44 Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=aTEKFiH2; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf21.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-HE-Tag: 1660283918-40840 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The most common case for certain node_random usage (demotion nodemask) is with nodemask weight 1. We can avoid calling get_random_init() in that case and always return the only node set in the nodemask. A simple test as below before = rdtsc_ordered(); for (i= 0; i < 100; i++) { rand = node_random(&nmask); } after = rdtsc_ordered(); Without fix after - before : 16438 With fix after - before : 816 Signed-off-by: Aneesh Kumar K.V --- include/linux/nodemask.h | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index 4b71a96190a8..ac5b6a371be5 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -504,12 +504,21 @@ static inline int num_node_state(enum node_states state) static inline int node_random(const nodemask_t *maskp) { #if defined(CONFIG_NUMA) && (MAX_NUMNODES > 1) - int w, bit = NUMA_NO_NODE; + int w, bit; w = nodes_weight(*maskp); - if (w) + switch (w) { + case 0: + bit = NUMA_NO_NODE; + break; + case 1: + bit = first_node(*maskp); + break; + default: bit = bitmap_ord_to_pos(maskp->bits, - get_random_int() % w, MAX_NUMNODES); + get_random_int() % w, MAX_NUMNODES); + break; + } return bit; #else return 0;