From patchwork Fri May 27 12:25:22 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12863320 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CFE92C433EF for ; Fri, 27 May 2022 12:26:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6B2138D0006; Fri, 27 May 2022 08:26:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 685E08D0001; Fri, 27 May 2022 08:26:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4D92B8D0006; Fri, 27 May 2022 08:26:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 403488D0001 for ; Fri, 27 May 2022 08:26:49 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 0CD5135903 for ; Fri, 27 May 2022 12:26:49 +0000 (UTC) X-FDA: 79511446938.26.F48ECBE Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf28.hostedemail.com (Postfix) with ESMTP id E7F99C0055 for ; Fri, 27 May 2022 12:26:13 +0000 (UTC) Received: from pps.filterd (m0098393.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 24RBN0fN031780; Fri, 27 May 2022 12:26:34 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-transfer-encoding : mime-version; s=pp1; bh=paC/N7Knk3u6wYpWG7QAYM+rQaDHntsIFqUghGL1LvM=; b=MjAXztzv9p7jcO6thbjlz5MKbKH5Ln112qdh8W8d5RMp5Aw3zgY7nJY5oIvwltQ6wnWF 5s0Pi9AALbjShgIMImxdYCX/rfITRzPV5Ddgt6LVejpgo2EQ7wCDscfp19MzYDj1nw3G rm54kuX1XtmDftEi5f7mNsEfUsGdTR8epIYX8ut9bbPTTGchBW1i1Q1sjI9OWUSblrqZ 6GNTKDY/YUm2LBjVlZg3IZwCCQemJtMEWKUcF7DWPPwx2yy5QQKCX4mM1AX4QLlFcmgK erb8UX2QB+weK96Cg9fLoeUex5Lx7HvjZoGX/vHENhWQHVwmbzigOK1q+swN45LKWJvX xw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gawqkh281-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:26:34 +0000 Received: from m0098393.ppops.net (m0098393.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 24RCQYsd013344; Fri, 27 May 2022 12:26:34 GMT Received: from ppma03wdc.us.ibm.com (ba.79.3fa9.ip4.static.sl-reverse.com [169.63.121.186]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gawqkh26d-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:26:33 +0000 Received: from pps.filterd (ppma03wdc.us.ibm.com [127.0.0.1]) by ppma03wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 24RCGdOA015414; Fri, 27 May 2022 12:26:32 GMT Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com [9.57.198.24]) by ppma03wdc.us.ibm.com with ESMTP id 3gabgmenq7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:26:32 +0000 Received: from b01ledav005.gho.pok.ibm.com (b01ledav005.gho.pok.ibm.com [9.57.199.110]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 24RCQVOS62849344 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 27 May 2022 12:26:31 GMT Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 52ABBAE063; Fri, 27 May 2022 12:26:31 +0000 (GMT) Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 75237AE05C; Fri, 27 May 2022 12:26:24 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.91.191]) by b01ledav005.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 27 May 2022 12:26:24 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K . V" Subject: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers Date: Fri, 27 May 2022 17:55:22 +0530 Message-Id: <20220527122528.129445-2-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> References: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: y25MeBebACys8xuhofPP7J0B-QXnqQSh X-Proofpoint-GUID: gCstOEmpGtR1srbWaaB6sMHh3EVmX2u8 X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.486,FMLib:17.11.64.514 definitions=2022-05-27_03,2022-05-27_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 clxscore=1015 suspectscore=0 bulkscore=0 spamscore=0 mlxlogscore=999 priorityscore=1501 impostorscore=0 mlxscore=0 adultscore=0 malwarescore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2205270057 X-Stat-Signature: pq36wf6symonjs95y66nic83k87rmns5 Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=MjAXztzv; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf28.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspam-User: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: E7F99C0055 X-HE-Tag: 1653654373-733766 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya In the current kernel, memory tiers are defined implicitly via a demotion path relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the top tier, and builds the tier hierarchy tier-by-tier by establishing the per-node demotion targets based on the distances between nodes. This current memory tier kernel interface needs to be improved for several important use cases, The current tier initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM device attached via CXL.mem or a DRAM-backed memory-only node on a virtual machine) and should be put into a higher tier. The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier. With current kernel higher tier node can only be demoted to selected nodes on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, hard-coded demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that. The current kernel also don't provide any interfaces for the userspace to learn about the memory tier hierarchy in order to optimize its memory allocations. This patch series address the above by defining memory tiers explicitly. This patch adds below sysfs interface which is read-only and can be used to read nodes available in specific tier. /sys/devices/system/memtier/memtierN/nodelist Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the lowest tier. The absolute value of a tier id number has no specific meaning. what matters is the relative order of the tier id numbers. All the tiered memory code is guarded by CONFIG_TIERED_MEMORY. Default number of memory tiers are MAX_MEMORY_TIERS(3). All the nodes are by default assigned to DEFAULT_MEMORY_TIER(1). Default memory tier can be read from, /sys/devices/system/memtier/default_tier Max memory tier can be read from, /sys/devices/system/memtier/max_tiers This patch implements the RFC spec sent by Wei Xu at [1]. [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/ Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- include/linux/migrate.h | 38 ++++++++---- mm/Kconfig | 11 ++++ mm/migrate.c | 134 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 170 insertions(+), 13 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 90e75d5a54d6..0ec653623565 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -47,17 +47,8 @@ void folio_migrate_copy(struct folio *newfolio, struct folio *folio); int folio_migrate_mapping(struct address_space *mapping, struct folio *newfolio, struct folio *folio, int extra_count); -extern bool numa_demotion_enabled; -extern void migrate_on_reclaim_init(void); -#ifdef CONFIG_HOTPLUG_CPU -extern void set_migration_target_nodes(void); -#else -static inline void set_migration_target_nodes(void) {} -#endif #else -static inline void set_migration_target_nodes(void) {} - static inline void putback_movable_pages(struct list_head *l) {} static inline int migrate_pages(struct list_head *l, new_page_t new, free_page_t free, unsigned long private, enum migrate_mode mode, @@ -82,7 +73,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, return -ENOSYS; } -#define numa_demotion_enabled false #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_COMPACTION @@ -172,15 +162,37 @@ struct migrate_vma { int migrate_vma_setup(struct migrate_vma *args); void migrate_vma_pages(struct migrate_vma *migrate); void migrate_vma_finalize(struct migrate_vma *migrate); -int next_demotion_node(int node); +#endif /* CONFIG_MIGRATION */ + +#ifdef CONFIG_TIERED_MEMORY + +extern bool numa_demotion_enabled; +#define DEFAULT_MEMORY_TIER 1 + +enum memory_tier_type { + MEMORY_TIER_HBM_GPU, + MEMORY_TIER_DRAM, + MEMORY_TIER_PMEM, + MAX_MEMORY_TIERS +}; -#else /* CONFIG_MIGRATION disabled: */ +int next_demotion_node(int node); +extern void migrate_on_reclaim_init(void); +#ifdef CONFIG_HOTPLUG_CPU +extern void set_migration_target_nodes(void); +#else +static inline void set_migration_target_nodes(void) {} +#endif +#else +#define numa_demotion_enabled false static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } -#endif /* CONFIG_MIGRATION */ +static inline void set_migration_target_nodes(void) {} +static inline void migrate_on_reclaim_init(void) {} +#endif /* CONFIG_TIERED_MEMORY */ #endif /* _LINUX_MIGRATE_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 034d87953600..7bfbddef46ed 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -258,6 +258,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION config ARCH_ENABLE_THP_MIGRATION bool +config TIERED_MEMORY + bool "Support for explicit memory tiers" + def_bool y + depends on MIGRATION && NUMA + help + Support to split nodes into memory tiers explicitly and + to demote pages on reclaim to lower tiers. This option + also exposes sysfs interface to read nodes available in + specific tier and to move specific node among different + possible tiers. + config HUGETLB_PAGE_SIZE_VARIABLE def_bool n help diff --git a/mm/migrate.c b/mm/migrate.c index 6c31ee1e1c9b..f28ee93fb017 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2118,6 +2118,113 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, #endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_NUMA */ +#ifdef CONFIG_TIERED_MEMORY + +struct memory_tier { + struct device dev; + nodemask_t nodelist; +}; + +#define to_memory_tier(device) container_of(device, struct memory_tier, dev) + +static struct bus_type memory_tier_subsys = { + .name = "memtier", + .dev_name = "memtier", +}; + +static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS]; + +static ssize_t nodelist_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + int tier = dev->id; + + return sysfs_emit(buf, "%*pbl\n", + nodemask_pr_args(&memory_tiers[tier]->nodelist)); + +} +static DEVICE_ATTR_RO(nodelist); + +static struct attribute *memory_tier_dev_attrs[] = { + &dev_attr_nodelist.attr, + NULL +}; + +static const struct attribute_group memory_tier_dev_group = { + .attrs = memory_tier_dev_attrs, +}; + +static const struct attribute_group *memory_tier_dev_groups[] = { + &memory_tier_dev_group, + NULL +}; + +static void memory_tier_device_release(struct device *dev) +{ + struct memory_tier *tier = to_memory_tier(dev); + + kfree(tier); +} + +static int register_memory_tier(int tier) +{ + int error; + + memory_tiers[tier] = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!memory_tiers[tier]) + return -ENOMEM; + + memory_tiers[tier]->dev.id = tier; + memory_tiers[tier]->dev.bus = &memory_tier_subsys; + memory_tiers[tier]->dev.release = memory_tier_device_release; + memory_tiers[tier]->dev.groups = memory_tier_dev_groups; + error = device_register(&memory_tiers[tier]->dev); + + if (error) { + put_device(&memory_tiers[tier]->dev); + memory_tiers[tier] = NULL; + } + + return error; +} + +static void unregister_memory_tier(int tier) +{ + device_unregister(&memory_tiers[tier]->dev); + memory_tiers[tier] = NULL; +} + +static ssize_t +max_tiers_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS); +} + +static DEVICE_ATTR_RO(max_tiers); + +static ssize_t +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", DEFAULT_MEMORY_TIER); +} + +static DEVICE_ATTR_RO(default_tier); + +static struct attribute *memoty_tier_attrs[] = { + &dev_attr_max_tiers.attr, + &dev_attr_default_tier.attr, + NULL +}; + +static const struct attribute_group memory_tier_attr_group = { + .attrs = memoty_tier_attrs, +}; + +static const struct attribute_group *memory_tier_attr_groups[] = { + &memory_tier_attr_group, + NULL, +}; + /* * node_demotion[] example: * @@ -2569,3 +2676,30 @@ static int __init numa_init_sysfs(void) } subsys_initcall(numa_init_sysfs); #endif + +static int __init memory_tier_init(void) +{ + int ret; + + ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); + if (ret) + panic("%s() failed to register subsystem: %d\n", __func__, ret); + + /* + * Register only default memory tier to hide all empty + * memory tier from sysfs. + */ + ret = register_memory_tier(DEFAULT_MEMORY_TIER); + if (ret) + panic("%s() failed to register memory tier: %d\n", __func__, ret); + + /* + * CPU only nodes are not part of memoty tiers. + */ + memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY]; + + return 0; +} +subsys_initcall(memory_tier_init); + +#endif /* CONFIG_TIERED_MEMORY */ From patchwork Fri May 27 12:25:23 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12863321 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3D0AC433EF for ; Fri, 27 May 2022 12:26:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2F1588D0007; Fri, 27 May 2022 08:26:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 29CE38D0001; Fri, 27 May 2022 08:26:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0C68D8D0007; Fri, 27 May 2022 08:26:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id F11E78D0001 for ; Fri, 27 May 2022 08:26:52 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id B7F3335A61 for ; Fri, 27 May 2022 12:26:52 +0000 (UTC) X-FDA: 79511447064.19.E76DBA4 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf19.hostedemail.com (Postfix) with ESMTP id 7503B1A003D for ; Fri, 27 May 2022 12:26:38 +0000 (UTC) Received: from pps.filterd (m0098393.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 24RBMwkb031744; Fri, 27 May 2022 12:26:42 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=Fh5UEeLMRf9EalXd/ZXawnSYAxXK72SmlA1AxC/jnKQ=; b=SFxbpfsA36+1KJnWb5Oq/sT1yblIVxYRYJbkEc04sgQmxc4nKdIk15O/tGdS2a8YPYTL 7RfAwv3iVeyCS5IdVAjYBjVnKm2bZOv8mCaI0u5UOD6jLbES+hGKj4MyFgYMutR3Pr02 60kH/Zwwct9CnngQglrrQfTqQVmxEbGfsD271sdaNBlrqb1HlD9PwEYLd78z6R9xIKhX ySsiUvOwak0L+rFE0HaTcyvlLmATq3LCSq7qvxhxM2N9PclL5VgDMFLksvHwXLsr6W+S eazzhGIgEGS0ZHeavZPoSErTwuq+eBIxduP0ohZwaN9K7x4RpSv2UoCZOLoEy54CtEOB Rw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gawqkh2dm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:26:42 +0000 Received: from m0098393.ppops.net (m0098393.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 24RCHRN3007895; Fri, 27 May 2022 12:26:41 GMT Received: from ppma02wdc.us.ibm.com (aa.5b.37a9.ip4.static.sl-reverse.com [169.55.91.170]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gawqkh2bv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:26:41 +0000 Received: from pps.filterd (ppma02wdc.us.ibm.com [127.0.0.1]) by ppma02wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 24RCKrWQ006323; Fri, 27 May 2022 12:26:39 GMT Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com [9.57.198.24]) by ppma02wdc.us.ibm.com with ESMTP id 3gagpkvfht-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:26:39 +0000 Received: from b01ledav005.gho.pok.ibm.com (b01ledav005.gho.pok.ibm.com [9.57.199.110]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 24RCQcsh62849384 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 27 May 2022 12:26:38 GMT Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9B8FCAE05F; Fri, 27 May 2022 12:26:38 +0000 (GMT) Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0DA8AAE05C; Fri, 27 May 2022 12:26:32 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.91.191]) by b01ledav005.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 27 May 2022 12:26:31 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K . V" Subject: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs Date: Fri, 27 May 2022 17:55:23 +0530 Message-Id: <20220527122528.129445-3-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> References: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: ThGXRwmPJYT5png8VQtIpgnBlOi_4uEG X-Proofpoint-GUID: j9QyA_In8oQWT8ZqeyLEGRg3tUQl6tqQ X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.486,FMLib:17.11.64.514 definitions=2022-05-27_03,2022-05-27_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 clxscore=1015 suspectscore=0 bulkscore=0 spamscore=0 mlxlogscore=999 priorityscore=1501 impostorscore=0 mlxscore=0 adultscore=0 malwarescore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2205270057 X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 7503B1A003D X-Stat-Signature: d66g5oudagr4498tkr5j6gregdqoi7uq Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=SFxbpfsA; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf19.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspam-User: X-HE-Tag: 1653654398-26265 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya Add support to read/write the memory tierindex for a NUMA node. /sys/devices/system/node/nodeN/memtier where N = node id When read, It list the memory tier that the node belongs to. When written, the kernel moves the node into the specified memory tier, the tier assignment of all other nodes are not affected. If the memory tier does not exist, writing to the above file create the tier and assign the NUMA node to that tier. mutex memory_tier_lock is introduced to protect memory tier related chanegs as it can happen from sysfs as well on hot plug events. Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- drivers/base/node.c | 35 ++++++++++++++ include/linux/migrate.h | 4 +- mm/migrate.c | 103 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 141 insertions(+), 1 deletion(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index ec8bb24a5a22..cf4a58446d8c 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -20,6 +20,7 @@ #include #include #include +#include static struct bus_type node_subsys = { .name = "node", @@ -560,11 +561,45 @@ static ssize_t node_read_distance(struct device *dev, } static DEVICE_ATTR(distance, 0444, node_read_distance, NULL); +#ifdef CONFIG_TIERED_MEMORY +static ssize_t memtier_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + int node = dev->id; + + return sysfs_emit(buf, "%d\n", node_get_memory_tier(node)); +} + +static ssize_t memtier_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + unsigned long tier; + int node = dev->id; + + int ret = kstrtoul(buf, 10, &tier); + if (ret) + return ret; + + ret = node_reset_memory_tier(node, tier); + if (ret) + return ret; + + return count; +} + +static DEVICE_ATTR_RW(memtier); +#endif + static struct attribute *node_dev_attrs[] = { &dev_attr_meminfo.attr, &dev_attr_numastat.attr, &dev_attr_distance.attr, &dev_attr_vmstat.attr, +#ifdef CONFIG_TIERED_MEMORY + &dev_attr_memtier.attr, +#endif NULL }; diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 0ec653623565..d37d1d5dee82 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -177,13 +177,15 @@ enum memory_tier_type { }; int next_demotion_node(int node); - extern void migrate_on_reclaim_init(void); #ifdef CONFIG_HOTPLUG_CPU extern void set_migration_target_nodes(void); #else static inline void set_migration_target_nodes(void) {} #endif +int node_get_memory_tier(int node); +int node_set_memory_tier(int node, int tier); +int node_reset_memory_tier(int node, int tier); #else #define numa_demotion_enabled false static inline int next_demotion_node(int node) diff --git a/mm/migrate.c b/mm/migrate.c index f28ee93fb017..304559ba3372 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2132,6 +2132,7 @@ static struct bus_type memory_tier_subsys = { .dev_name = "memtier", }; +DEFINE_MUTEX(memory_tier_lock); static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS]; static ssize_t nodelist_show(struct device *dev, @@ -2225,6 +2226,108 @@ static const struct attribute_group *memory_tier_attr_groups[] = { NULL, }; +static int __node_get_memory_tier(int node) +{ + int tier; + + for (tier = 0; tier < MAX_MEMORY_TIERS; tier++) { + if (memory_tiers[tier] && node_isset(node, memory_tiers[tier]->nodelist)) + return tier; + } + + return -1; +} + +int node_get_memory_tier(int node) +{ + int tier; + + /* + * Make sure memory tier is not unregistered + * while it is being read. + */ + mutex_lock(&memory_tier_lock); + + tier = __node_get_memory_tier(node); + + mutex_unlock(&memory_tier_lock); + + return tier; +} + +int __node_set_memory_tier(int node, int tier) +{ + int ret = 0; + /* + * As register_memory_tier() for new tier can fail, + * try it before modifying existing tier. register + * tier makes tier visible in sysfs. + */ + if (!memory_tiers[tier]) { + ret = register_memory_tier(tier); + if (ret) { + goto out; + } + } + + node_set(node, memory_tiers[tier]->nodelist); + +out: + return ret; +} + +int node_reset_memory_tier(int node, int tier) +{ + int current_tier, ret = 0; + + mutex_lock(&memory_tier_lock); + + current_tier = __node_get_memory_tier(node); + if (current_tier == tier) + goto out; + + if (current_tier != -1 ) + node_clear(node, memory_tiers[current_tier]->nodelist); + + ret = __node_set_memory_tier(node, tier); + + if (!ret) { + if (nodes_empty(memory_tiers[current_tier]->nodelist)) + unregister_memory_tier(current_tier); + } else { + /* reset it back to older tier */ + ret = __node_set_memory_tier(node, current_tier); + } +out: + mutex_unlock(&memory_tier_lock); + + return ret; +} + +int node_set_memory_tier(int node, int tier) +{ + int current_tier, ret = 0; + + if (tier >= MAX_MEMORY_TIERS) + return -EINVAL; + + mutex_lock(&memory_tier_lock); + current_tier = __node_get_memory_tier(node); + /* + * if node is already part of the tier proceed with the + * current tier value, because we might want to establish + * new migration paths now. The node might be added to a tier + * before it was made part of N_MEMORY, hence estabilish_migration_targets + * will have skipped this node. + */ + if (current_tier != -1) + tier = current_tier; + ret = __node_set_memory_tier(node, tier); + mutex_unlock(&memory_tier_lock); + + return ret; +} + /* * node_demotion[] example: * From patchwork Fri May 27 12:25:24 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12863322 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 770E4C433F5 for ; Fri, 27 May 2022 12:27:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 086838D0008; Fri, 27 May 2022 08:27:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 05A518D0001; Fri, 27 May 2022 08:27:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E13968D0008; Fri, 27 May 2022 08:27:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id D2A028D0001 for ; Fri, 27 May 2022 08:27:05 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id AD2C460BF6 for ; Fri, 27 May 2022 12:27:05 +0000 (UTC) X-FDA: 79511447610.15.5199924 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf30.hostedemail.com (Postfix) with ESMTP id 53BFA80040 for ; Fri, 27 May 2022 12:26:34 +0000 (UTC) Received: from pps.filterd (m0098414.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 24RCI3jd018107; Fri, 27 May 2022 12:26:48 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=RYgk1NreGrZOj5fkX24YPAQfZxbmqNMrFwBYYk5H2V4=; b=PWWGLN2yBrVzlyjqSRa4Ja5gfUMIa9OsrkDSC92tdfifJntOonbwUagzuMAieFw7/Id2 pVxckN7yUbncoJ2V7Dhb060jPpV9RkkWH3mt6tPxFjKs4Oumj+oF6PBw4U+e3Mp7WbBw JA27uBRILK9WzDE4R0tWYaUxuHHqsUEonsGUu6UhdpT/ZrVY3N0NwyGtiezWLUxV2GGa RoC7gvGwsId645/G4wuX4u/qiI30bNwxKThyXNlHsIbUW8HBpmsTuSQvdjq9TmRCeDRX AdMgpkPLYlUaHtCzjQnuard4JcEo8FT/5gdt4auGdX5Tss22gF0Iuw1R6UGuFqluOD4o ew== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gaxhcg4y4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:26:48 +0000 Received: from m0098414.ppops.net (m0098414.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 24RCJZ6Y020263; Fri, 27 May 2022 12:26:47 GMT Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gaxhcg4xu-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:26:47 +0000 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 24RCL5SC006928; Fri, 27 May 2022 12:26:47 GMT Received: from b01cxnp23032.gho.pok.ibm.com (b01cxnp23032.gho.pok.ibm.com [9.57.198.27]) by ppma04wdc.us.ibm.com with ESMTP id 3gaenbn8ex-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:26:47 +0000 Received: from b01ledav005.gho.pok.ibm.com (b01ledav005.gho.pok.ibm.com [9.57.199.110]) by b01cxnp23032.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 24RCQkwS21889528 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 27 May 2022 12:26:46 GMT Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5E0BDAE05F; Fri, 27 May 2022 12:26:46 +0000 (GMT) Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 52F28AE05C; Fri, 27 May 2022 12:26:39 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.91.191]) by b01ledav005.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 27 May 2022 12:26:38 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K . V" Subject: [RFC PATCH v4 3/7] mm/demotion: Build demotion targets based on explicit memory tiers Date: Fri, 27 May 2022 17:55:24 +0530 Message-Id: <20220527122528.129445-4-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> References: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: OLkauQR6GJSGp_Z7-AvSmgHGZeiDofmI X-Proofpoint-ORIG-GUID: 1XTNkL7G-1XKei7FChZGq6ZK9UPdi_SN X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.486,FMLib:17.11.64.514 definitions=2022-05-27_03,2022-05-27_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 lowpriorityscore=0 phishscore=0 mlxscore=0 mlxlogscore=999 malwarescore=0 impostorscore=0 clxscore=1015 bulkscore=0 suspectscore=0 priorityscore=1501 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2205270057 X-Rspam-User: X-Rspamd-Queue-Id: 53BFA80040 X-Stat-Signature: iwar4qzyffxd9wqbpajycgzpt5tm1d1f Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=PWWGLN2y; spf=pass (imf30.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspamd-Server: rspam09 X-HE-Tag: 1653654394-927057 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya This patch switch the demotion target building logic to use memory tiers instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the default tier 1 and additional memory tiers will be added by drivers like dax kmem. This patch builds the demotion target for a NUMA node by looking at all memory tiers below the tier to which the NUMA node belongs. The closest node in the immediately following memory tier is used as a demotion target. Since we are now only building demotion target for N_MEMORY NUMA nodes the CPU hotplug calls are removed in this patch. Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V Reported-by: kernel test robot --- include/linux/migrate.h | 8 - mm/migrate.c | 460 +++++++++++++++------------------------- mm/vmstat.c | 5 - 3 files changed, 172 insertions(+), 301 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index d37d1d5dee82..cbef71a499c1 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -177,12 +177,6 @@ enum memory_tier_type { }; int next_demotion_node(int node); -extern void migrate_on_reclaim_init(void); -#ifdef CONFIG_HOTPLUG_CPU -extern void set_migration_target_nodes(void); -#else -static inline void set_migration_target_nodes(void) {} -#endif int node_get_memory_tier(int node); int node_set_memory_tier(int node, int tier); int node_reset_memory_tier(int node, int tier); @@ -193,8 +187,6 @@ static inline int next_demotion_node(int node) return NUMA_NO_NODE; } -static inline void set_migration_target_nodes(void) {} -static inline void migrate_on_reclaim_init(void) {} #endif /* CONFIG_TIERED_MEMORY */ #endif /* _LINUX_MIGRATE_H */ diff --git a/mm/migrate.c b/mm/migrate.c index 304559ba3372..d819a64db5b1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2125,6 +2125,10 @@ struct memory_tier { nodemask_t nodelist; }; +struct demotion_nodes { + nodemask_t preferred; +}; + #define to_memory_tier(device) container_of(device, struct memory_tier, dev) static struct bus_type memory_tier_subsys = { @@ -2132,9 +2136,73 @@ static struct bus_type memory_tier_subsys = { .dev_name = "memtier", }; +static void establish_migration_targets(void); + DEFINE_MUTEX(memory_tier_lock); static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS]; +/* + * node_demotion[] examples: + * + * Example 1: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes. + * + * node distances: + * node 0 1 2 3 + * 0 10 20 30 40 + * 1 20 10 40 30 + * 2 30 40 10 40 + * 3 40 30 40 10 + * + * memory_tiers[0] = + * memory_tiers[1] = 0-1 + * memory_tiers[2] = 2-3 + * + * node_demotion[0].preferred = 2 + * node_demotion[1].preferred = 3 + * node_demotion[2].preferred = + * node_demotion[3].preferred = + * + * Example 2: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 30 + * 2 30 30 10 + * + * memory_tiers[0] = + * memory_tiers[1] = 0-2 + * memory_tiers[2] = + * + * node_demotion[0].preferred = + * node_demotion[1].preferred = + * node_demotion[2].preferred = + * + * Example 3: + * + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 40 + * 2 30 40 10 + * + * memory_tiers[0] = 1 + * memory_tiers[1] = 0 + * memory_tiers[2] = 2 + * + * node_demotion[0].preferred = 2 + * node_demotion[1].preferred = 0 + * node_demotion[2].preferred = + * + */ +static struct demotion_nodes *node_demotion __read_mostly; + static ssize_t nodelist_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -2238,6 +2306,28 @@ static int __node_get_memory_tier(int node) return -1; } +static void node_remove_from_memory_tier(int node) +{ + int tier; + + mutex_lock(&memory_tier_lock); + + tier = __node_get_memory_tier(node); + + /* + * Remove node from tier, if tier becomes + * empty then unregister it to make it invisible + * in sysfs. + */ + node_clear(node, memory_tiers[tier]->nodelist); + if (nodes_empty(memory_tiers[tier]->nodelist)) + unregister_memory_tier(tier); + + establish_migration_targets(); + + mutex_unlock(&memory_tier_lock); +} + int node_get_memory_tier(int node) { int tier; @@ -2271,6 +2361,7 @@ int __node_set_memory_tier(int node, int tier) } node_set(node, memory_tiers[tier]->nodelist); + establish_migration_targets(); out: return ret; @@ -2328,75 +2419,6 @@ int node_set_memory_tier(int node, int tier) return ret; } -/* - * node_demotion[] example: - * - * Consider a system with two sockets. Each socket has - * three classes of memory attached: fast, medium and slow. - * Each memory class is placed in its own NUMA node. The - * CPUs are placed in the node with the "fast" memory. The - * 6 NUMA nodes (0-5) might be split among the sockets like - * this: - * - * Socket A: 0, 1, 2 - * Socket B: 3, 4, 5 - * - * When Node 0 fills up, its memory should be migrated to - * Node 1. When Node 1 fills up, it should be migrated to - * Node 2. The migration path start on the nodes with the - * processors (since allocations default to this node) and - * fast memory, progress through medium and end with the - * slow memory: - * - * 0 -> 1 -> 2 -> stop - * 3 -> 4 -> 5 -> stop - * - * This is represented in the node_demotion[] like this: - * - * { nr=1, nodes[0]=1 }, // Node 0 migrates to 1 - * { nr=1, nodes[0]=2 }, // Node 1 migrates to 2 - * { nr=0, nodes[0]=-1 }, // Node 2 does not migrate - * { nr=1, nodes[0]=4 }, // Node 3 migrates to 4 - * { nr=1, nodes[0]=5 }, // Node 4 migrates to 5 - * { nr=0, nodes[0]=-1 }, // Node 5 does not migrate - * - * Moreover some systems may have multiple slow memory nodes. - * Suppose a system has one socket with 3 memory nodes, node 0 - * is fast memory type, and node 1/2 both are slow memory - * type, and the distance between fast memory node and slow - * memory node is same. So the migration path should be: - * - * 0 -> 1/2 -> stop - * - * This is represented in the node_demotion[] like this: - * { nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2 - * { nr=0, nodes[0]=-1, }, // Node 1 dose not migrate - * { nr=0, nodes[0]=-1, }, // Node 2 does not migrate - */ - -/* - * Writes to this array occur without locking. Cycles are - * not allowed: Node X demotes to Y which demotes to X... - * - * If multiple reads are performed, a single rcu_read_lock() - * must be held over all reads to ensure that no cycles are - * observed. - */ -#define DEFAULT_DEMOTION_TARGET_NODES 15 - -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES -#define DEMOTION_TARGET_NODES (MAX_NUMNODES - 1) -#else -#define DEMOTION_TARGET_NODES DEFAULT_DEMOTION_TARGET_NODES -#endif - -struct demotion_nodes { - unsigned short nr; - short nodes[DEMOTION_TARGET_NODES]; -}; - -static struct demotion_nodes *node_demotion __read_mostly; - /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -2409,8 +2431,7 @@ static struct demotion_nodes *node_demotion __read_mostly; int next_demotion_node(int node) { struct demotion_nodes *nd; - unsigned short target_nr, index; - int target; + int target, nnodes, i; if (!node_demotion) return NUMA_NO_NODE; @@ -2419,61 +2440,46 @@ int next_demotion_node(int node) /* * node_demotion[] is updated without excluding this - * function from running. RCU doesn't provide any - * compiler barriers, so the READ_ONCE() is required - * to avoid compiler reordering or read merging. + * function from running. * * Make sure to use RCU over entire code blocks if * node_demotion[] reads need to be consistent. */ rcu_read_lock(); - target_nr = READ_ONCE(nd->nr); - switch (target_nr) { - case 0: - target = NUMA_NO_NODE; - goto out; - case 1: - index = 0; - break; - default: - /* - * If there are multiple target nodes, just select one - * target node randomly. - * - * In addition, we can also use round-robin to select - * target node, but we should introduce another variable - * for node_demotion[] to record last selected target node, - * that may cause cache ping-pong due to the changing of - * last target node. Or introducing per-cpu data to avoid - * caching issue, which seems more complicated. So selecting - * target node randomly seems better until now. - */ - index = get_random_int() % target_nr; - break; - } + nnodes = nodes_weight(nd->preferred); + if (!nnodes) + return NUMA_NO_NODE; - target = READ_ONCE(nd->nodes[index]); + /* + * If there are multiple target nodes, just select one + * target node randomly. + * + * In addition, we can also use round-robin to select + * target node, but we should introduce another variable + * for node_demotion[] to record last selected target node, + * that may cause cache ping-pong due to the changing of + * last target node. Or introducing per-cpu data to avoid + * caching issue, which seems more complicated. So selecting + * target node randomly seems better until now. + */ + nnodes = get_random_int() % nnodes; + target = first_node(nd->preferred); + for (i = 0; i < nnodes; i++) + target = next_node(target, nd->preferred); -out: rcu_read_unlock(); + return target; } -#if defined(CONFIG_HOTPLUG_CPU) /* Disable reclaim-based migration. */ static void __disable_all_migrate_targets(void) { - int node, i; + int node; - if (!node_demotion) - return; - - for_each_online_node(node) { - node_demotion[node].nr = 0; - for (i = 0; i < DEMOTION_TARGET_NODES; i++) - node_demotion[node].nodes[i] = NUMA_NO_NODE; - } + for_each_node_mask(node, node_states[N_MEMORY]) + node_demotion[node].preferred = NODE_MASK_NONE; } static void disable_all_migrate_targets(void) @@ -2485,173 +2491,70 @@ static void disable_all_migrate_targets(void) * Readers will see either a combination of before+disable * state or disable+after. They will never see before and * after state together. - * - * The before+after state together might have cycles and - * could cause readers to do things like loop until this - * function finishes. This ensures they can only see a - * single "bad" read and would, for instance, only loop - * once. */ synchronize_rcu(); } /* - * Find an automatic demotion target for 'node'. - * Failing here is OK. It might just indicate - * being at the end of a chain. - */ -static int establish_migrate_target(int node, nodemask_t *used, - int best_distance) +* Find an automatic demotion target for all memory +* nodes. Failing here is OK. It might just indicate +* being at the end of a chain. +*/ +static void establish_migration_targets(void) { - int migration_target, index, val; struct demotion_nodes *nd; + int tier, target = NUMA_NO_NODE, node; + int distance, best_distance; + nodemask_t used; if (!node_demotion) - return NUMA_NO_NODE; - - nd = &node_demotion[node]; - - migration_target = find_next_best_node(node, used); - if (migration_target == NUMA_NO_NODE) - return NUMA_NO_NODE; - - /* - * If the node has been set a migration target node before, - * which means it's the best distance between them. Still - * check if this node can be demoted to other target nodes - * if they have a same best distance. - */ - if (best_distance != -1) { - val = node_distance(node, migration_target); - if (val > best_distance) - goto out_clear; - } - - index = nd->nr; - if (WARN_ONCE(index >= DEMOTION_TARGET_NODES, - "Exceeds maximum demotion target nodes\n")) - goto out_clear; - - nd->nodes[index] = migration_target; - nd->nr++; + return; - return migration_target; -out_clear: - node_clear(migration_target, *used); - return NUMA_NO_NODE; -} + disable_all_migrate_targets(); -/* - * When memory fills up on a node, memory contents can be - * automatically migrated to another node instead of - * discarded at reclaim. - * - * Establish a "migration path" which will start at nodes - * with CPUs and will follow the priorities used to build the - * page allocator zonelists. - * - * The difference here is that cycles must be avoided. If - * node0 migrates to node1, then neither node1, nor anything - * node1 migrates to can migrate to node0. Also one node can - * be migrated to multiple nodes if the target nodes all have - * a same best-distance against the source node. - * - * This function can run simultaneously with readers of - * node_demotion[]. However, it can not run simultaneously - * with itself. Exclusion is provided by memory hotplug events - * being single-threaded. - */ -static void __set_migration_target_nodes(void) -{ - nodemask_t next_pass = NODE_MASK_NONE; - nodemask_t this_pass = NODE_MASK_NONE; - nodemask_t used_targets = NODE_MASK_NONE; - int node, best_distance; + for_each_node_mask(node, node_states[N_MEMORY]) { + best_distance = -1; + nd = &node_demotion[node]; - /* - * Avoid any oddities like cycles that could occur - * from changes in the topology. This will leave - * a momentary gap when migration is disabled. - */ - disable_all_migrate_targets(); + tier = __node_get_memory_tier(node); + /* + * Find next tier to demote. + */ + while (++tier < MAX_MEMORY_TIERS) { + if (memory_tiers[tier]) + break; + } - /* - * Allocations go close to CPUs, first. Assume that - * the migration path starts at the nodes with CPUs. - */ - next_pass = node_states[N_CPU]; -again: - this_pass = next_pass; - next_pass = NODE_MASK_NONE; - /* - * To avoid cycles in the migration "graph", ensure - * that migration sources are not future targets by - * setting them in 'used_targets'. Do this only - * once per pass so that multiple source nodes can - * share a target node. - * - * 'used_targets' will become unavailable in future - * passes. This limits some opportunities for - * multiple source nodes to share a destination. - */ - nodes_or(used_targets, used_targets, this_pass); + if (tier >= MAX_MEMORY_TIERS) + continue; - for_each_node_mask(node, this_pass) { - best_distance = -1; + nodes_andnot(used, node_states[N_MEMORY], memory_tiers[tier]->nodelist); /* - * Try to set up the migration path for the node, and the target - * migration nodes can be multiple, so doing a loop to find all - * the target nodes if they all have a best node distance. + * Find all the nodes in the memory tier node list of same best distance. + * add add them to the preferred mask. We randomly select between nodes + * in the preferred mask when allocating pages during demotion. */ do { - int target_node = - establish_migrate_target(node, &used_targets, - best_distance); - - if (target_node == NUMA_NO_NODE) + target = find_next_best_node(node, &used); + if (target == NUMA_NO_NODE) break; - if (best_distance == -1) - best_distance = node_distance(node, target_node); - - /* - * Visit targets from this pass in the next pass. - * Eventually, every node will have been part of - * a pass, and will become set in 'used_targets'. - */ - node_set(target_node, next_pass); + distance = node_distance(node, target); + if (distance == best_distance || best_distance == -1) { + best_distance = distance; + node_set(target, nd->preferred); + } else { + break; + } } while (1); } - /* - * 'next_pass' contains nodes which became migration - * targets in this pass. Make additional passes until - * no more migrations targets are available. - */ - if (!nodes_empty(next_pass)) - goto again; } /* - * For callers that do not hold get_online_mems() already. - */ -void set_migration_target_nodes(void) -{ - get_online_mems(); - __set_migration_target_nodes(); - put_online_mems(); -} - -/* - * This leaves migrate-on-reclaim transiently disabled between - * the MEM_GOING_OFFLINE and MEM_OFFLINE events. This runs - * whether reclaim-based migration is enabled or not, which - * ensures that the user can turn reclaim-based migration at - * any time without needing to recalculate migration targets. - * - * These callbacks already hold get_online_mems(). That is why - * __set_migration_target_nodes() can be used as opposed to - * set_migration_target_nodes(). + * This runs whether reclaim-based migration is enabled or not, + * which ensures that the user can turn reclaim-based migration + * at any time without needing to recalculate migration targets. */ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, unsigned long action, void *_arg) @@ -2660,64 +2563,44 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, /* * Only update the node migration order when a node is - * changing status, like online->offline. This avoids - * the overhead of synchronize_rcu() in most cases. + * changing status, like online->offline. */ if (arg->status_change_nid < 0) return notifier_from_errno(0); switch (action) { - case MEM_GOING_OFFLINE: - /* - * Make sure there are not transient states where - * an offline node is a migration target. This - * will leave migration disabled until the offline - * completes and the MEM_OFFLINE case below runs. - */ - disable_all_migrate_targets(); - break; case MEM_OFFLINE: - case MEM_ONLINE: /* - * Recalculate the target nodes once the node - * reaches its final state (online or offline). + * In case we are moving out of N_MEMORY. Keep the node + * in the memory tier so that when we bring memory online, + * they appear in the right memory tier. We still need + * to rebuild the demotion order. */ - __set_migration_target_nodes(); + mutex_lock(&memory_tier_lock); + establish_migration_targets(); + mutex_unlock(&memory_tier_lock); break; - case MEM_CANCEL_OFFLINE: + case MEM_ONLINE: /* - * MEM_GOING_OFFLINE disabled all the migration - * targets. Reenable them. + * We ignore the error here, if the node already have the tier + * registered, we will continue to use that for the new memory + * we are adding here. */ - __set_migration_target_nodes(); - break; - case MEM_GOING_ONLINE: - case MEM_CANCEL_ONLINE: + node_set_memory_tier(arg->status_change_nid, DEFAULT_MEMORY_TIER); break; } return notifier_from_errno(0); } -void __init migrate_on_reclaim_init(void) +static void __init migrate_on_reclaim_init(void) { - node_demotion = kmalloc_array(nr_node_ids, - sizeof(struct demotion_nodes), - GFP_KERNEL); + node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes), + GFP_KERNEL); WARN_ON(!node_demotion); hotplug_memory_notifier(migrate_on_reclaim_callback, 100); - /* - * At this point, all numa nodes with memory/CPus have their state - * properly set, so we can build the demotion order now. - * Let us hold the cpu_hotplug lock just, as we could possibily have - * CPU hotplug events during boot. - */ - cpus_read_lock(); - set_migration_target_nodes(); - cpus_read_unlock(); } -#endif /* CONFIG_HOTPLUG_CPU */ bool numa_demotion_enabled = false; @@ -2800,6 +2683,7 @@ static int __init memory_tier_init(void) * CPU only nodes are not part of memoty tiers. */ memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY]; + migrate_on_reclaim_init(); return 0; } diff --git a/mm/vmstat.c b/mm/vmstat.c index b75b1a64b54c..7815d21345a4 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -2053,7 +2053,6 @@ static int vmstat_cpu_online(unsigned int cpu) if (!node_state(cpu_to_node(cpu), N_CPU)) { node_set_state(cpu_to_node(cpu), N_CPU); - set_migration_target_nodes(); } return 0; @@ -2078,7 +2077,6 @@ static int vmstat_cpu_dead(unsigned int cpu) return 0; node_clear_state(node, N_CPU); - set_migration_target_nodes(); return 0; } @@ -2111,9 +2109,6 @@ void __init init_mm_internals(void) start_shepherd_timer(); #endif -#if defined(CONFIG_MIGRATION) && defined(CONFIG_HOTPLUG_CPU) - migrate_on_reclaim_init(); -#endif #ifdef CONFIG_PROC_FS proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op); proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op); From patchwork Fri May 27 12:25:25 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12863323 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB7D6C433FE for ; Fri, 27 May 2022 12:27:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 668528D0009; Fri, 27 May 2022 08:27:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 616998D0001; Fri, 27 May 2022 08:27:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4B1558D0009; Fri, 27 May 2022 08:27:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 3C3E88D0001 for ; Fri, 27 May 2022 08:27:08 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 0CBB920B01 for ; Fri, 27 May 2022 12:27:07 +0000 (UTC) X-FDA: 79511447736.30.E5A777A Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf31.hostedemail.com (Postfix) with ESMTP id A02BF20045 for ; Fri, 27 May 2022 12:26:30 +0000 (UTC) Received: from pps.filterd (m0098416.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 24RBHqs5019718; Fri, 27 May 2022 12:26:57 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=NMpA+V1bfC+kM9xtdNjyw4XB6F+J5PWCsONsomp9VQ8=; b=EE1AFDe1Evh9tyLeFGCHNLsXmPPL77oD+sx3xBdle4X7G9UkEMdovydgO7Fam3450p4G oNwIiFqeS3lbJPrTIkv3gsovs24L2cKGplRMFLHAR5ni5vwl8DPx43QuQ0wDWSvG27Tk bixFwMvfwiVLOwRGlT9HsWne24ER8aJewnHgvatqM6PO3/fcoz3WeRUh9QTNJDLV5ooP AMYLVbhQLZN/4wgxAvspBzszq9sn5qRhHNV286djJyHSmeEoVLjNjg4wz4DPHhQ9QE94 CGOsfzwOSnQY7ulBzHHbRhR6WIW6hnereSlL7pSoFJaFj+u5l/zaAs2z3GiCy05BRuom Wg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gawn7s575-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:26:56 +0000 Received: from m0098416.ppops.net (m0098416.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 24RCJVpf008350; Fri, 27 May 2022 12:26:56 GMT Received: from ppma02wdc.us.ibm.com (aa.5b.37a9.ip4.static.sl-reverse.com [169.55.91.170]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gawn7s56r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:26:56 +0000 Received: from pps.filterd (ppma02wdc.us.ibm.com [127.0.0.1]) by ppma02wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 24RCKutu006361; Fri, 27 May 2022 12:26:55 GMT Received: from b01cxnp23033.gho.pok.ibm.com (b01cxnp23033.gho.pok.ibm.com [9.57.198.28]) by ppma02wdc.us.ibm.com with ESMTP id 3gagpkvfkp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:26:55 +0000 Received: from b01ledav005.gho.pok.ibm.com (b01ledav005.gho.pok.ibm.com [9.57.199.110]) by b01cxnp23033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 24RCQsXl40763888 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 27 May 2022 12:26:54 GMT Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7E411AE05F; Fri, 27 May 2022 12:26:54 +0000 (GMT) Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1AF45AE05C; Fri, 27 May 2022 12:26:47 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.91.191]) by b01ledav005.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 27 May 2022 12:26:46 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K . V" Subject: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Date: Fri, 27 May 2022 17:55:25 +0530 Message-Id: <20220527122528.129445-5-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> References: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: lxJz5G3F24fRPN59Z5UgaLwEjy0CZpHn X-Proofpoint-ORIG-GUID: 7gNicu8INe--vcQ4WVA43G3AVzmpzCDX X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.486,FMLib:17.11.64.514 definitions=2022-05-27_03,2022-05-27_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 priorityscore=1501 mlxscore=0 suspectscore=0 mlxlogscore=999 adultscore=0 lowpriorityscore=0 spamscore=0 malwarescore=0 impostorscore=0 phishscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2205270057 Authentication-Results: imf31.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=EE1AFDe1; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf31.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: A02BF20045 X-Stat-Signature: 8fekifx5q5p5ni9tfzrwzhjcqxryme34 X-HE-Tag: 1653654390-988027 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya By default, all nodes are assigned to DEFAULT_MEMORY_TIER which is memory tier 1 which is designated for nodes with DRAM, so it is not the right tier for dax devices. Set dax kmem device node's tier to MEMORY_TIER_PMEM, In future, support should be added to distinguish the dax-devices which should not be MEMORY_TIER_PMEM and right memory tier should be set for them. Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- drivers/dax/kmem.c | 4 ++++ mm/migrate.c | 2 ++ 2 files changed, 6 insertions(+) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index a37622060fff..991782aa2448 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -11,6 +11,7 @@ #include #include #include +#include #include "dax-private.h" #include "bus.h" @@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) dev_set_drvdata(dev, data); +#ifdef CONFIG_TIERED_MEMORY + node_set_memory_tier(numa_node, MEMORY_TIER_PMEM); +#endif return 0; err_request_mem: diff --git a/mm/migrate.c b/mm/migrate.c index d819a64db5b1..59d8558dd2ee 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2418,6 +2418,8 @@ int node_set_memory_tier(int node, int tier) return ret; } +EXPORT_SYMBOL_GPL(node_set_memory_tier); + /** * next_demotion_node() - Get the next node in the demotion path From patchwork Fri May 27 12:25:26 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12863324 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1857FC433EF for ; Fri, 27 May 2022 12:27:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A94AB8D000A; Fri, 27 May 2022 08:27:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A47668D0001; Fri, 27 May 2022 08:27:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8BE278D000A; Fri, 27 May 2022 08:27:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 75B8F8D0001 for ; Fri, 27 May 2022 08:27:21 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay11.hostedemail.com (Postfix) with ESMTP id 34A518122F for ; Fri, 27 May 2022 12:27:21 +0000 (UTC) X-FDA: 79511448282.23.6CEAE21 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf01.hostedemail.com (Postfix) with ESMTP id CC35E40045 for ; Fri, 27 May 2022 12:27:15 +0000 (UTC) Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 24RCJVWo001386; Fri, 27 May 2022 12:27:05 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=dSL5o6tQgRnfr1+Kx7mXPPAdiylCVwGNvI4tvH+fN1A=; b=gbjYviGObuXq7RMXs3vLE0HUQksS5UYSnOEnugp1bnUFWyPiVJJJVx0ukesOlgkXuDTu 7h1TnPFJlNibwq7tr2c8fkFCtlLVTlxLGnWn8714Tk+KR1P90g+sXyeuRXqn0dLUmGha U1rAhgU6Ih95Bhq5IXV/OhScAeW/6BLd1gEfQOxogvybAh6PUVufK5qlHo5J/tWOTDd/ 4PWJMTWAuOtEaBFNqSe+/vfaV9n4BE+hrq6mNuGKOuU6YeTmTHYdmNwsk3vN0ql1Py/D kEIApP44KYNYytLWSiXdnm7bBuZQqmWfNP65KOi5oeESltKfwqxMEbdYFv4lOHAVig/i Ig== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gaxhqr4ae-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:27:04 +0000 Received: from m0098394.ppops.net (m0098394.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 24RCR4IV023291; Fri, 27 May 2022 12:27:04 GMT Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gaxhqr4a0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:27:03 +0000 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 24RCL6TO006945; Fri, 27 May 2022 12:27:02 GMT Received: from b01cxnp23033.gho.pok.ibm.com (b01cxnp23033.gho.pok.ibm.com [9.57.198.28]) by ppma04wdc.us.ibm.com with ESMTP id 3gaenbn8g3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:27:02 +0000 Received: from b01ledav005.gho.pok.ibm.com (b01ledav005.gho.pok.ibm.com [9.57.199.110]) by b01cxnp23033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 24RCR1eo38470130 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 27 May 2022 12:27:01 GMT Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C1862AE060; Fri, 27 May 2022 12:27:01 +0000 (GMT) Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 12956AE05F; Fri, 27 May 2022 12:26:55 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.91.191]) by b01ledav005.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 27 May 2022 12:26:54 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier Date: Fri, 27 May 2022 17:55:26 +0530 Message-Id: <20220527122528.129445-6-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> References: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: DH0zbHA1Sr2N4oQwiSfySCGa-tVQslhC X-Proofpoint-ORIG-GUID: j75_7h0jbo2EMtK6aCawQ8gTdR7jd3s6 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.486,FMLib:17.11.64.514 definitions=2022-05-27_03,2022-05-27_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 lowpriorityscore=0 impostorscore=0 malwarescore=0 phishscore=0 mlxlogscore=999 clxscore=1015 mlxscore=0 spamscore=0 bulkscore=0 suspectscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2205270057 Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=gbjYviGO; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf01.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: CC35E40045 X-Stat-Signature: as76z3733gfohieenpjqfh16ufzophwd X-HE-Tag: 1653654435-852943 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The rank approach allows us to keep memory tier device IDs stable even if there is a need to change the tier ordering among different memory tiers. e.g. DRAM nodes with CPUs will always be on memtier1, no matter how many tiers are higher or lower than these nodes. A new memory tier can be inserted into the tier hierarchy for a new set of nodes without affecting the node assignment of any existing memtier, provided that there is enough gap in the rank values for the new memtier. The absolute value of "rank" of a memtier doesn't necessarily carry any meaning. Its value relative to other memtiers decides the level of this memtier in the tier hierarchy. For now, This patch supports hardcoded rank values which are 100, 200, & 300 for memory tiers 0,1 & 2 respectively. Below is the sysfs interface to read the rank values of memory tier, /sys/devices/system/memtier/memtierN/rank This interface is read only for now, write support can be added when there is a need of flexibility of more number of memory tiers(> 3) with flexibile ordering requirement among them, rank can be utilized there as rank decides now memory tiering ordering and not memory tier device ids. Signed-off-by: Aneesh Kumar K.V --- drivers/base/node.c | 5 +- drivers/dax/kmem.c | 2 +- include/linux/migrate.h | 17 ++-- mm/migrate.c | 218 ++++++++++++++++++++++++---------------- 4 files changed, 144 insertions(+), 98 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index cf4a58446d8c..892f7c23c94e 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -567,8 +567,11 @@ static ssize_t memtier_show(struct device *dev, char *buf) { int node = dev->id; + int tier_index = node_get_memory_tier_id(node); - return sysfs_emit(buf, "%d\n", node_get_memory_tier(node)); + if (tier_index != -1) + return sysfs_emit(buf, "%d\n", tier_index); + return 0; } static ssize_t memtier_store(struct device *dev, diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 991782aa2448..79953426ddaf 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -149,7 +149,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) dev_set_drvdata(dev, data); #ifdef CONFIG_TIERED_MEMORY - node_set_memory_tier(numa_node, MEMORY_TIER_PMEM); + node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM); #endif return 0; diff --git a/include/linux/migrate.h b/include/linux/migrate.h index cbef71a499c1..fd09fd009a69 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -167,18 +167,19 @@ void migrate_vma_finalize(struct migrate_vma *migrate); #ifdef CONFIG_TIERED_MEMORY extern bool numa_demotion_enabled; -#define DEFAULT_MEMORY_TIER 1 - enum memory_tier_type { - MEMORY_TIER_HBM_GPU, - MEMORY_TIER_DRAM, - MEMORY_TIER_PMEM, - MAX_MEMORY_TIERS + MEMORY_RANK_HBM_GPU, + MEMORY_RANK_DRAM, + DEFAULT_MEMORY_RANK = MEMORY_RANK_DRAM, + MEMORY_RANK_PMEM }; +#define DEFAULT_MEMORY_TIER 1 +#define MAX_MEMORY_TIERS 3 + int next_demotion_node(int node); -int node_get_memory_tier(int node); -int node_set_memory_tier(int node, int tier); +int node_get_memory_tier_id(int node); +int node_set_memory_tier_rank(int node, int tier); int node_reset_memory_tier(int node, int tier); #else #define numa_demotion_enabled false diff --git a/mm/migrate.c b/mm/migrate.c index 59d8558dd2ee..f013d14f77ed 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2121,8 +2121,10 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, #ifdef CONFIG_TIERED_MEMORY struct memory_tier { + struct list_head list; struct device dev; nodemask_t nodelist; + int rank; }; struct demotion_nodes { @@ -2139,7 +2141,7 @@ static struct bus_type memory_tier_subsys = { static void establish_migration_targets(void); DEFINE_MUTEX(memory_tier_lock); -static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS]; +static LIST_HEAD(memory_tiers); /* * node_demotion[] examples: @@ -2206,16 +2208,25 @@ static struct demotion_nodes *node_demotion __read_mostly; static ssize_t nodelist_show(struct device *dev, struct device_attribute *attr, char *buf) { - int tier = dev->id; + struct memory_tier *memtier = to_memory_tier(dev); return sysfs_emit(buf, "%*pbl\n", - nodemask_pr_args(&memory_tiers[tier]->nodelist)); - + nodemask_pr_args(&memtier->nodelist)); } static DEVICE_ATTR_RO(nodelist); +static ssize_t rank_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct memory_tier *memtier = to_memory_tier(dev); + + return sysfs_emit(buf, "%d\n", memtier->rank); +} +static DEVICE_ATTR_RO(rank); + static struct attribute *memory_tier_dev_attrs[] = { &dev_attr_nodelist.attr, + &dev_attr_rank.attr, NULL }; @@ -2235,53 +2246,79 @@ static void memory_tier_device_release(struct device *dev) kfree(tier); } -static int register_memory_tier(int tier) +static void insert_memory_tier(struct memory_tier *memtier) +{ + struct list_head *ent; + struct memory_tier *tmp_memtier; + + list_for_each(ent, &memory_tiers) { + tmp_memtier = list_entry(ent, struct memory_tier, list); + if (tmp_memtier->rank > memtier->rank) { + list_add_tail(&memtier->list, ent); + return; + } + } + list_add_tail(&memtier->list, &memory_tiers); +} + +static struct memory_tier *register_memory_tier(unsigned int tier) { int error; + struct memory_tier *memtier; - memory_tiers[tier] = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); - if (!memory_tiers[tier]) - return -ENOMEM; + if (tier >= MAX_MEMORY_TIERS) + return NULL; - memory_tiers[tier]->dev.id = tier; - memory_tiers[tier]->dev.bus = &memory_tier_subsys; - memory_tiers[tier]->dev.release = memory_tier_device_release; - memory_tiers[tier]->dev.groups = memory_tier_dev_groups; - error = device_register(&memory_tiers[tier]->dev); + memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!memtier) + return NULL; + memtier->dev.id = tier; + /* + * For now we only supported hardcoded rank value which + * 100, 200, 300 with no special meaning. + */ + memtier->rank = 100 + 100 * tier; + memtier->dev.bus = &memory_tier_subsys; + memtier->dev.release = memory_tier_device_release; + memtier->dev.groups = memory_tier_dev_groups; + + insert_memory_tier(memtier); + + error = device_register(&memtier->dev); if (error) { - put_device(&memory_tiers[tier]->dev); - memory_tiers[tier] = NULL; + list_del(&memtier->list); + put_device(&memtier->dev); + return NULL; } - - return error; + return memtier; } -static void unregister_memory_tier(int tier) +static void unregister_memory_tier(struct memory_tier *memtier) { - device_unregister(&memory_tiers[tier]->dev); - memory_tiers[tier] = NULL; + list_del(&memtier->list); + device_unregister(&memtier->dev); } static ssize_t -max_tiers_show(struct device *dev, struct device_attribute *attr, char *buf) +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf) { return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS); } -static DEVICE_ATTR_RO(max_tiers); +static DEVICE_ATTR_RO(max_tier); static ssize_t -default_tier_show(struct device *dev, struct device_attribute *attr, char *buf) +default_rank_show(struct device *dev, struct device_attribute *attr, char *buf) { - return sysfs_emit(buf, "%d\n", DEFAULT_MEMORY_TIER); + return sysfs_emit(buf, "%d\n", 100 + 100 * DEFAULT_MEMORY_TIER); } -static DEVICE_ATTR_RO(default_tier); +static DEVICE_ATTR_RO(default_rank); static struct attribute *memoty_tier_attrs[] = { - &dev_attr_max_tiers.attr, - &dev_attr_default_tier.attr, + &dev_attr_max_tier.attr, + &dev_attr_default_rank.attr, NULL }; @@ -2294,52 +2331,61 @@ static const struct attribute_group *memory_tier_attr_groups[] = { NULL, }; -static int __node_get_memory_tier(int node) +static struct memory_tier *__node_get_memory_tier(int node) { - int tier; + struct memory_tier *memtier; - for (tier = 0; tier < MAX_MEMORY_TIERS; tier++) { - if (memory_tiers[tier] && node_isset(node, memory_tiers[tier]->nodelist)) - return tier; + list_for_each_entry(memtier, &memory_tiers, list) { + if (node_isset(node, memtier->nodelist)) + return memtier; } + return NULL; +} - return -1; +static struct memory_tier *__get_memory_tier_from_id(int id) +{ + struct memory_tier *memtier; + + list_for_each_entry(memtier, &memory_tiers, list) { + if (memtier->dev.id == id) + return memtier; + } + return NULL; } + static void node_remove_from_memory_tier(int node) { - int tier; + struct memory_tier *memtier; mutex_lock(&memory_tier_lock); - tier = __node_get_memory_tier(node); - + memtier = __node_get_memory_tier(node); /* * Remove node from tier, if tier becomes * empty then unregister it to make it invisible * in sysfs. */ - node_clear(node, memory_tiers[tier]->nodelist); - if (nodes_empty(memory_tiers[tier]->nodelist)) - unregister_memory_tier(tier); + node_clear(node, memtier->nodelist); + if (nodes_empty(memtier->nodelist)) + unregister_memory_tier(memtier); establish_migration_targets(); - mutex_unlock(&memory_tier_lock); } -int node_get_memory_tier(int node) +int node_get_memory_tier_id(int node) { - int tier; - + int tier = -1; + struct memory_tier *memtier; /* * Make sure memory tier is not unregistered * while it is being read. */ mutex_lock(&memory_tier_lock); - - tier = __node_get_memory_tier(node); - + memtier = __node_get_memory_tier(node); + if (memtier) + tier = memtier->dev.id; mutex_unlock(&memory_tier_lock); return tier; @@ -2348,46 +2394,43 @@ int node_get_memory_tier(int node) int __node_set_memory_tier(int node, int tier) { int ret = 0; - /* - * As register_memory_tier() for new tier can fail, - * try it before modifying existing tier. register - * tier makes tier visible in sysfs. - */ - if (!memory_tiers[tier]) { - ret = register_memory_tier(tier); - if (ret) { + struct memory_tier *memtier; + + memtier = __get_memory_tier_from_id(tier); + if (!memtier) { + memtier = register_memory_tier(tier); + if (!memtier) { + ret = -EINVAL; goto out; } } - - node_set(node, memory_tiers[tier]->nodelist); + node_set(node, memtier->nodelist); establish_migration_targets(); - out: return ret; } int node_reset_memory_tier(int node, int tier) { - int current_tier, ret = 0; + struct memory_tier *current_tier; + int ret = 0; mutex_lock(&memory_tier_lock); current_tier = __node_get_memory_tier(node); - if (current_tier == tier) + if (!current_tier || current_tier->dev.id == tier) goto out; - if (current_tier != -1 ) - node_clear(node, memory_tiers[current_tier]->nodelist); + node_clear(node, current_tier->nodelist); ret = __node_set_memory_tier(node, tier); if (!ret) { - if (nodes_empty(memory_tiers[current_tier]->nodelist)) + if (nodes_empty(current_tier->nodelist)) unregister_memory_tier(current_tier); } else { /* reset it back to older tier */ - ret = __node_set_memory_tier(node, current_tier); + node_set(node, current_tier->nodelist); } out: mutex_unlock(&memory_tier_lock); @@ -2395,15 +2438,13 @@ int node_reset_memory_tier(int node, int tier) return ret; } -int node_set_memory_tier(int node, int tier) +int node_set_memory_tier_rank(int node, int rank) { - int current_tier, ret = 0; - - if (tier >= MAX_MEMORY_TIERS) - return -EINVAL; + struct memory_tier *memtier; + int ret = 0; mutex_lock(&memory_tier_lock); - current_tier = __node_get_memory_tier(node); + memtier = __node_get_memory_tier(node); /* * if node is already part of the tier proceed with the * current tier value, because we might want to establish @@ -2411,15 +2452,17 @@ int node_set_memory_tier(int node, int tier) * before it was made part of N_MEMORY, hence estabilish_migration_targets * will have skipped this node. */ - if (current_tier != -1) - tier = current_tier; - ret = __node_set_memory_tier(node, tier); + if (memtier) + establish_migration_targets(); + else { + /* For now rank value and tier value is same. */ + ret = __node_set_memory_tier(node, rank); + } mutex_unlock(&memory_tier_lock); return ret; } -EXPORT_SYMBOL_GPL(node_set_memory_tier); - +EXPORT_SYMBOL_GPL(node_set_memory_tier_rank); /** * next_demotion_node() - Get the next node in the demotion path @@ -2504,6 +2547,8 @@ static void disable_all_migrate_targets(void) */ static void establish_migration_targets(void) { + struct list_head *ent; + struct memory_tier *memtier; struct demotion_nodes *nd; int tier, target = NUMA_NO_NODE, node; int distance, best_distance; @@ -2518,19 +2563,15 @@ static void establish_migration_targets(void) best_distance = -1; nd = &node_demotion[node]; - tier = __node_get_memory_tier(node); + memtier = __node_get_memory_tier(node); + if (!memtier || list_is_last(&memtier->list, &memory_tiers)) + continue; /* - * Find next tier to demote. + * Get the next memtier to find the demotion node list. */ - while (++tier < MAX_MEMORY_TIERS) { - if (memory_tiers[tier]) - break; - } + memtier = list_next_entry(memtier, list); - if (tier >= MAX_MEMORY_TIERS) - continue; - - nodes_andnot(used, node_states[N_MEMORY], memory_tiers[tier]->nodelist); + nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist); /* * Find all the nodes in the memory tier node list of same best distance. @@ -2588,7 +2629,7 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, * registered, we will continue to use that for the new memory * we are adding here. */ - node_set_memory_tier(arg->status_change_nid, DEFAULT_MEMORY_TIER); + node_set_memory_tier_rank(arg->status_change_nid, DEFAULT_MEMORY_RANK); break; } @@ -2668,6 +2709,7 @@ subsys_initcall(numa_init_sysfs); static int __init memory_tier_init(void) { int ret; + struct memory_tier *memtier; ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); if (ret) @@ -2677,14 +2719,14 @@ static int __init memory_tier_init(void) * Register only default memory tier to hide all empty * memory tier from sysfs. */ - ret = register_memory_tier(DEFAULT_MEMORY_TIER); - if (ret) + memtier = register_memory_tier(DEFAULT_MEMORY_TIER); + if (!memtier) panic("%s() failed to register memory tier: %d\n", __func__, ret); /* * CPU only nodes are not part of memoty tiers. */ - memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY]; + memtier->nodelist = node_states[N_MEMORY]; migrate_on_reclaim_init(); return 0; From patchwork Fri May 27 12:25:27 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12863325 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D00DDC433EF for ; Fri, 27 May 2022 12:27:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 58EFE8D000B; Fri, 27 May 2022 08:27:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4C2CE8D0001; Fri, 27 May 2022 08:27:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2EC798D000B; Fri, 27 May 2022 08:27:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 19F068D0001 for ; Fri, 27 May 2022 08:27:26 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id DDB88608D1 for ; Fri, 27 May 2022 12:27:25 +0000 (UTC) X-FDA: 79511448450.28.F838B2F Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf29.hostedemail.com (Postfix) with ESMTP id 6C3FE120046 for ; Fri, 27 May 2022 12:27:13 +0000 (UTC) Received: from pps.filterd (m0127361.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 24RBssO5011758; Fri, 27 May 2022 12:27:11 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=k4WXuxqHLYtUs3nR7hNS9Y447qsMw5Ie+ESvXbW0TY4=; b=ktYMfB33insMRpZVSTP8+XlsFKc0bC2z4lhUtz/ua/DPmN6jG4q/SNhnntI+IlY0ssU6 019mCvKYedwDieOwwllRZ1oerKQYnOiUBvI3qD1OnZl29os7VJWvApcu3IgveyTBIRnc tousejH+4QAXWWJCrQfao7JGagifq3KdiUyrTTiLFOxB5gr9OL1ld8/6ZT6wz1VV7d64 se2Con8GYItSDL2az+zEMALdiuDHvt0fzj2USsJIG9GV51Klg9YkJDban8t7czo71ttZ lnQLlhjDBCGu09GnPvbDnEMTHgm1OA+Bx8XGYW5whuf3sshyZR+bN3wYKozLfI+gXC/i BQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gax6j0k8p-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:27:11 +0000 Received: from m0127361.ppops.net (m0127361.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 24RBvdwV020349; Fri, 27 May 2022 12:27:10 GMT Received: from ppma01wdc.us.ibm.com (fd.55.37a9.ip4.static.sl-reverse.com [169.55.85.253]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gax6j0k8e-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:27:10 +0000 Received: from pps.filterd (ppma01wdc.us.ibm.com [127.0.0.1]) by ppma01wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 24RCKYv4004761; Fri, 27 May 2022 12:27:09 GMT Received: from b01cxnp22033.gho.pok.ibm.com (b01cxnp22033.gho.pok.ibm.com [9.57.198.23]) by ppma01wdc.us.ibm.com with ESMTP id 3gaas1ew5w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:27:09 +0000 Received: from b01ledav005.gho.pok.ibm.com (b01ledav005.gho.pok.ibm.com [9.57.199.110]) by b01cxnp22033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 24RCR9Uo33948130 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 27 May 2022 12:27:09 GMT Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 38705AE05F; Fri, 27 May 2022 12:27:09 +0000 (GMT) Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 72BD9AE05C; Fri, 27 May 2022 12:27:02 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.91.191]) by b01ledav005.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 27 May 2022 12:27:02 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [RFC PATCH v4 6/7] mm/demotion: Add support for removing node from demotion memory tiers Date: Fri, 27 May 2022 17:55:27 +0530 Message-Id: <20220527122528.129445-7-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> References: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: dluQ2Up9Gb7xeJ3Rsrh0Pj-U_vb0eyY3 X-Proofpoint-GUID: DKqpbexucjAM265GtFPYEvLZhMR05igB X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.486,FMLib:17.11.64.514 definitions=2022-05-27_03,2022-05-27_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 phishscore=0 lowpriorityscore=0 mlxscore=0 spamscore=0 bulkscore=0 priorityscore=1501 adultscore=0 clxscore=1015 mlxlogscore=999 suspectscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2205270057 X-Stat-Signature: 757t7j3nujbn9tmfd45krf5pd5h37u9s X-Rspam-User: Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=ktYMfB33; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf29.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 6C3FE120046 X-HE-Tag: 1653654433-2337 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch adds the special string "none" as a supported memtier value that we can use to remove a specific node from being using as demotion target. For ex: :/sys/devices/system/node/node1# cat memtier 1 :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist 1-3 :/sys/devices/system/node/node1# echo none > memtier :/sys/devices/system/node/node1# :/sys/devices/system/node/node1# cat memtier :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist 2-3 :/sys/devices/system/node/node1# Signed-off-by: Aneesh Kumar K.V --- drivers/base/node.c | 7 ++++++- include/linux/migrate.h | 1 + mm/migrate.c | 15 +++++++++++++-- 3 files changed, 20 insertions(+), 3 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index 892f7c23c94e..5311cf1db500 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -578,10 +578,15 @@ static ssize_t memtier_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { + int ret; unsigned long tier; int node = dev->id; - int ret = kstrtoul(buf, 10, &tier); + if (!strncmp(buf, "none", strlen("none"))) { + node_remove_from_memory_tier(node); + return count; + } + ret = kstrtoul(buf, 10, &tier); if (ret) return ret; diff --git a/include/linux/migrate.h b/include/linux/migrate.h index fd09fd009a69..77c581f47953 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -178,6 +178,7 @@ enum memory_tier_type { #define MAX_MEMORY_TIERS 3 int next_demotion_node(int node); +void node_remove_from_memory_tier(int node); int node_get_memory_tier_id(int node); int node_set_memory_tier_rank(int node, int tier); int node_reset_memory_tier(int node, int tier); diff --git a/mm/migrate.c b/mm/migrate.c index f013d14f77ed..114c7428b9f3 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2354,7 +2354,7 @@ static struct memory_tier *__get_memory_tier_from_id(int id) } -static void node_remove_from_memory_tier(int node) +void node_remove_from_memory_tier(int node) { struct memory_tier *memtier; @@ -2418,7 +2418,18 @@ int node_reset_memory_tier(int node, int tier) mutex_lock(&memory_tier_lock); current_tier = __node_get_memory_tier(node); - if (!current_tier || current_tier->dev.id == tier) + if (!current_tier) { + /* + * If a N_MEMORY node doesn't have a tier index, then + * we removed it from demotion earlier and we are trying + * add it back. Just add the node to requested tier. + */ + if (node_state(node, N_MEMORY)) + ret = __node_set_memory_tier(node, tier); + goto out; + } + + if (current_tier->dev.id == tier) goto out; node_clear(node, current_tier->nodelist); From patchwork Fri May 27 12:25:28 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12863326 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A8E40C433FE for ; Fri, 27 May 2022 12:27:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 87C648D0001; Fri, 27 May 2022 08:27:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 84E848D000C; Fri, 27 May 2022 08:27:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 677348D000D; Fri, 27 May 2022 08:27:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 4DEF38D000C for ; Fri, 27 May 2022 08:27:26 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 252FF32DDA for ; Fri, 27 May 2022 12:27:26 +0000 (UTC) X-FDA: 79511448492.01.0FDCEF7 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf26.hostedemail.com (Postfix) with ESMTP id 0D2B7140004 for ; Fri, 27 May 2022 12:27:20 +0000 (UTC) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 24RBmhkI008032; Fri, 27 May 2022 12:27:19 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=PyERyMAgw05gN1oOCqQVYLFbKQwtDSBRFZSi5AKtpig=; b=mk+1j8pxMIFT375bRSGQRHWyTZmov8IXMe69DhlfsyVa/U0sHxSbS/QthT3/AI6ke1zY 6n48+mFknpuiPrslkELppqNPIJmPDWEBzSA4fsO+CStKMJaK1F93Jnve8oGKcto5jmO8 CekD181DqUR702EuOHSr3X9LlJxK5zFeBHhobEgFJhcPpt7mieLrjf+aftgAmOrfchOs ohk1bF+bcjcHGHK6YfiWNk4yOqjyg4oPwUbE0/v9RpUWvHgxTdnZVL3LLagHRAEJHIgj P6C7L/P04RYXMvbyMwr1vHJlFeL/SEgmF6td0/UfLxD5vZsvg3iV2nkpcpkTWC9q0kLT vg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gax3n0n4v-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:27:19 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 24RCH7CP013652; Fri, 27 May 2022 12:27:18 GMT Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gax3n0n4m-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:27:18 +0000 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 24RCL6uh006940; Fri, 27 May 2022 12:27:17 GMT Received: from b01cxnp22033.gho.pok.ibm.com (b01cxnp22033.gho.pok.ibm.com [9.57.198.23]) by ppma04wdc.us.ibm.com with ESMTP id 3gaenbn8h4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 May 2022 12:27:17 +0000 Received: from b01ledav005.gho.pok.ibm.com (b01ledav005.gho.pok.ibm.com [9.57.199.110]) by b01cxnp22033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 24RCRGKq29688218 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 27 May 2022 12:27:16 GMT Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C5AC7AE05F; Fri, 27 May 2022 12:27:16 +0000 (GMT) Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E68A9AE05C; Fri, 27 May 2022 12:27:09 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.91.191]) by b01ledav005.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 27 May 2022 12:27:09 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K . V" Subject: [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order Date: Fri, 27 May 2022 17:55:28 +0530 Message-Id: <20220527122528.129445-8-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> References: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: T0dnHSiKJMe451gjRMRD22uPyDWVz9Iv X-Proofpoint-ORIG-GUID: LYhF5tD3ead4QJs9SC9dSF5F2YdO1B24 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.486,FMLib:17.11.64.514 definitions=2022-05-27_03,2022-05-27_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 priorityscore=1501 mlxscore=0 lowpriorityscore=0 spamscore=0 mlxlogscore=999 phishscore=0 clxscore=1015 suspectscore=0 adultscore=0 bulkscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2205270057 X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 0D2B7140004 X-Stat-Signature: 4baarairx8xrrdq6jboii4q8tzaox7yu Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=mk+1j8px; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf26.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspam-User: X-HE-Tag: 1653654440-914692 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya currently, a higher tier node can only be demoted to selected nodes on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, hard-coded demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space). This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that currently. This patch adds support to get all the allowed demotion targets mask for node, also demote_page_list() function is modified to utilize this allowed node mask by filling it in migration_target_control structure before passing it to migrate_pages(). Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- include/linux/migrate.h | 5 ++++ mm/migrate.c | 52 +++++++++++++++++++++++++++++++++++++---- mm/vmscan.c | 38 ++++++++++++++---------------- 3 files changed, 71 insertions(+), 24 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 77c581f47953..1f3cbd5185ca 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -182,6 +182,7 @@ void node_remove_from_memory_tier(int node); int node_get_memory_tier_id(int node); int node_set_memory_tier_rank(int node, int tier); int node_reset_memory_tier(int node, int tier); +void node_get_allowed_targets(int node, nodemask_t *targets); #else #define numa_demotion_enabled false static inline int next_demotion_node(int node) @@ -189,6 +190,10 @@ static inline int next_demotion_node(int node) return NUMA_NO_NODE; } +static inline void node_get_allowed_targets(int node, nodemask_t *targets) +{ + *targets = NODE_MASK_NONE; +} #endif /* CONFIG_TIERED_MEMORY */ #endif /* _LINUX_MIGRATE_H */ diff --git a/mm/migrate.c b/mm/migrate.c index 114c7428b9f3..84fac477538c 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2129,6 +2129,7 @@ struct memory_tier { struct demotion_nodes { nodemask_t preferred; + nodemask_t allowed; }; #define to_memory_tier(device) container_of(device, struct memory_tier, dev) @@ -2475,6 +2476,25 @@ int node_set_memory_tier_rank(int node, int rank) } EXPORT_SYMBOL_GPL(node_set_memory_tier_rank); +void node_get_allowed_targets(int node, nodemask_t *targets) +{ + /* + * node_demotion[] is updated without excluding this + * function from running. + * + * If any node is moving to lower tiers then modifications + * in node_demotion[] are still valid for this node, if any + * node is moving to higher tier then moving node may be + * used once for demotion which should be ok so rcu should + * be enough here. + */ + rcu_read_lock(); + + *targets = node_demotion[node].allowed; + + rcu_read_unlock(); +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -2534,8 +2554,10 @@ static void __disable_all_migrate_targets(void) { int node; - for_each_node_mask(node, node_states[N_MEMORY]) + for_each_node_mask(node, node_states[N_MEMORY]) { node_demotion[node].preferred = NODE_MASK_NONE; + node_demotion[node].allowed = NODE_MASK_NONE; + } } static void disable_all_migrate_targets(void) @@ -2558,12 +2580,11 @@ static void disable_all_migrate_targets(void) */ static void establish_migration_targets(void) { - struct list_head *ent; struct memory_tier *memtier; struct demotion_nodes *nd; - int tier, target = NUMA_NO_NODE, node; + int target = NUMA_NO_NODE, node; int distance, best_distance; - nodemask_t used; + nodemask_t used, allowed = NODE_MASK_NONE; if (!node_demotion) return; @@ -2603,6 +2624,29 @@ static void establish_migration_targets(void) } } while (1); } + /* + * Now build the allowed mask for each node collecting node mask from + * all memory tier below it. This allows us to fallback demotion page + * allocation to a set of nodes that is closer the above selected + * perferred node. + */ + list_for_each_entry(memtier, &memory_tiers, list) + nodes_or(allowed, allowed, memtier->nodelist); + /* + * Removes nodes not yet in N_MEMORY. + */ + nodes_and(allowed, node_states[N_MEMORY], allowed); + + list_for_each_entry(memtier, &memory_tiers, list) { + /* + * Keep removing current tier from allowed nodes, + * This will remove all nodes in current and above + * memory tier from the allowed mask. + */ + nodes_andnot(allowed, allowed, memtier->nodelist); + for_each_node_mask(node, memtier->nodelist) + node_demotion[node].allowed = allowed; + } } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 1678802e03e7..feb994589481 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1454,23 +1454,6 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(&folio->page, dirty, writeback); } -static struct page *alloc_demote_page(struct page *page, unsigned long node) -{ - struct migration_target_control mtc = { - /* - * Allocate from 'node', or fail quickly and quietly. - * When this happens, 'page' will likely just be discarded - * instead of migrated. - */ - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | - __GFP_THISNODE | __GFP_NOWARN | - __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid = node - }; - - return alloc_migration_target(page, (unsigned long)&mtc); -} - /* * Take pages on @demote_list and attempt to demote them to * another node. Pages which are not demoted are left on @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; + nodemask_t allowed_mask; + + struct migration_target_control mtc = { + /* + * Allocate from 'node', or fail quickly and quietly. + * When this happens, 'page' will likely just be discarded + * instead of migrated. + */ + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | + __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid = target_nid, + .nmask = &allowed_mask + }; if (list_empty(demote_pages)) return 0; @@ -1488,10 +1484,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, if (target_nid == NUMA_NO_NODE) return 0; + node_get_allowed_targets(pgdat->node_id, &allowed_mask); + /* Demotion ignores all cpuset and mempolicy settings */ - migrate_pages(demote_pages, alloc_demote_page, NULL, - target_nid, MIGRATE_ASYNC, MR_DEMOTION, - &nr_succeeded); + migrate_pages(demote_pages, alloc_migration_target, NULL, + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, + &nr_succeeded); if (current_is_kswapd()) __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);