From patchwork Fri Jun 10 13:52:26 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12877631 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57A83C43334 for ; Fri, 10 Jun 2022 13:55:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DB0D88D00AE; Fri, 10 Jun 2022 09:55:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D61738D00AF; Fri, 10 Jun 2022 09:55:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BD9F18D00AE; Fri, 10 Jun 2022 09:55:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A9EA08D009C for ; Fri, 10 Jun 2022 09:55:39 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 7D19B347B8 for ; Fri, 10 Jun 2022 13:55:39 +0000 (UTC) X-FDA: 79562473998.21.2B0F3BC Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf23.hostedemail.com (Postfix) with ESMTP id D38AA140066 for ; Fri, 10 Jun 2022 13:55:38 +0000 (UTC) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25ADSwkM014931; Fri, 10 Jun 2022 13:54:24 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=FpxL9Ln0v69rC5KIlQJwCAYoGwImaAzWCPNQvJ60qXQ=; b=fdIxt0zxlEGhWffkJ34cWNH5xr9EFe2+k3oNYZwD4vh2vVfR8c4Y3UjTXcw+pjzuUz/C JTtvzr9lRGFeoYZ2OVgbF887Th06jUI2ci3bBeGZBy64Z7d6l3cL1raqTuaHP/WZokbX R0c0vLEig3nWevFtEoBEOjMNl9V2WbQ+zqqKNmTc7qvLRibRdoGvRKOkc/8HGOnoiCME 3PkiOOA7J0m9nDz58MWRi/VS9F81BhAEQ2qfhXlz4xBFOsDAI6FX2LXVuwoBAq8bIMG9 LWOMkHLq4d2fGWcsB+dd6VrcWT5wahURSiFzqJWVFSGyZzGB4yVLWu5L5kDDW8ePwYyU Rw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6vn0hny-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:24 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25ADTVND016419; Fri, 10 Jun 2022 13:54:23 GMT Received: from ppma02dal.us.ibm.com (a.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.10]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gm6vn0hnm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:23 +0000 Received: from pps.filterd (ppma02dal.us.ibm.com [127.0.0.1]) by ppma02dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25ADZUgA014203; Fri, 10 Jun 2022 13:54:22 GMT Received: from b03cxnp08026.gho.boulder.ibm.com (b03cxnp08026.gho.boulder.ibm.com [9.17.130.18]) by ppma02dal.us.ibm.com with ESMTP id 3gfy1bb4nu-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Jun 2022 13:54:22 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp08026.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 25ADsLhV27787614 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Jun 2022 13:54:21 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 15B9C6A054; Fri, 10 Jun 2022 13:54:21 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A75976A051; Fri, 10 Jun 2022 13:54:12 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.90.151]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 10 Jun 2022 13:54:12 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K . V" Subject: [PATCH v6 10/13] mm/demotion: Demote pages according to allocation fallback order Date: Fri, 10 Jun 2022 19:22:26 +0530 Message-Id: <20220610135229.182859-11-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> References: <20220610135229.182859-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: K2OOSuMIPyM457UvPbCNfWNYppAmgXji X-Proofpoint-ORIG-GUID: uusJ_FgopAssqf6frHcE5ZCvfQ4kDUV8 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-10_06,2022-06-09_02,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 mlxlogscore=999 mlxscore=0 adultscore=0 spamscore=0 priorityscore=1501 phishscore=0 bulkscore=0 suspectscore=0 impostorscore=0 lowpriorityscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206100056 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654869339; a=rsa-sha256; cv=none; b=YGrZwkes68TqS4WijP92puTYpbgMLIULNZn8NYkMQwjhF/ZphLkBl/7cG8UKKIIzkMJP7A UumubfCW1T8E54n5P5V+EEgg9/hzYNZVn7kVMOFLVFgAuZrtWP38cJG9KlBImHFz06QpXZ Bz9qy5AxwrWrliaFtwJpkJkhNMl3X/U= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=fdIxt0zx; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf23.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654869339; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FpxL9Ln0v69rC5KIlQJwCAYoGwImaAzWCPNQvJ60qXQ=; b=3qM13IS4RY8acyi2Oo+g6ekngKCGmDAefFPaYg/+hnzDsgrRiH7Zp5R/dCU2Pgw1HBOoOX xKbMo0azKRIZwA9TvojtOVgSj2U++OdF6qUPAsx97SWH2bW4lcWMv0C1tqfiEzJ/NH9rPH N0Ka0Yvn4cUsgHxkOz7oPftEqDHrnHg= X-Rspamd-Queue-Id: D38AA140066 X-Rspam-User: Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=fdIxt0zx; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf23.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Stat-Signature: jhz7m5yrmnmfcna534x13gy5ewkshr9n X-Rspamd-Server: rspam02 X-HE-Tag: 1654869338-543506 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya Currently, a higher tier node can only be demoted to selected nodes on the next lower tier as defined by the demotion path. This strict, hard-coded demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space). This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that currently. This patch adds support to get all the allowed demotion targets for a memory tier. demote_page_list() function is now modified to utilize this allowed node mask as the fallback allocation mask. Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V move allowed mask to memory tier --- include/linux/memory-tiers.h | 9 ++++- mm/memory-tiers.c | 75 +++++++++++++++++++++++++++++++++--- mm/vmscan.c | 56 ++++++++++++++++++++------- 3 files changed, 120 insertions(+), 20 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 53f3e4c7cba8..47841379553c 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -3,11 +3,12 @@ #define _LINUX_MEMORY_TIERS_H #include +#include +#include #ifdef CONFIG_TIERED_MEMORY #include -#include #define MEMORY_TIER_HBM_GPU 0 #define MEMORY_TIER_DRAM 1 @@ -25,6 +26,7 @@ struct memory_tier { struct list_head list; struct device dev; nodemask_t nodelist; + nodemask_t lower_tier_mask; int rank; }; @@ -36,6 +38,7 @@ int node_get_memory_tier_id(int node); int node_reset_memory_tier(int node, int tier); struct memory_tier *node_get_memory_tier(int node); void node_put_memory_tier(struct memory_tier *memtier); +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); #else @@ -45,6 +48,10 @@ static inline int next_demotion_node(int node) return NUMA_NO_NODE; } +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + *targets = NODE_MASK_NONE; +} #endif /* CONFIG_TIERED_MEMORY */ #endif diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 429aa864edb0..b2ed16dcfb03 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -425,6 +425,24 @@ void node_put_memory_tier(struct memory_tier *memtier) put_device(&memtier->dev); } +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + struct memory_tier *memtier; + + /* + * pg_data_t.memtier updates includes a synchronize_rcu() + * which ensures that we either find NULL or a valid memtier + * in NODE_DATA. protect the access via rcu_read_lock(); + */ + rcu_read_lock(); + memtier = rcu_dereference(pgdat->memtier); + if (memtier) + *targets = memtier->lower_tier_mask; + else + *targets = NODE_MASK_NONE; + rcu_read_unlock(); +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -473,10 +491,18 @@ int next_demotion_node(int node) /* Disable reclaim-based migration. */ static void __disable_all_migrate_targets(void) { + pg_data_t *pgdat; int node; - for_each_node_state(node, N_MEMORY) + for_each_node_state(node, N_MEMORY) { node_demotion[node].preferred = NODE_MASK_NONE; + /* + * We are holding memory_tier_lock, it is safe + * to access pgda->memtier. + */ + pgdat = NODE_DATA(node); + pgdat->memtier->lower_tier_mask = NODE_MASK_NONE; + } } static void disable_all_migrate_targets(void) @@ -503,10 +529,26 @@ static void establish_migration_targets(void) struct demotion_nodes *nd; int target = NUMA_NO_NODE, node; int distance, best_distance; - nodemask_t used; - - if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) - return; + nodemask_t used, lower_tier = NODE_MASK_NONE; + + if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) { + pg_data_t *pgdat; + + for_each_node_state(node, N_MEMORY) { + /* + * We are holding memory_tier_lock, it is safe + * to access pgda->memtier. + */ + pgdat = NODE_DATA(node); + pgdat->memtier->lower_tier_mask = NODE_MASK_NONE; + } + /* + * Wait for read side to work with old values + * or see the updated NODE_MASK_NONE; + */ + synchronize_rcu(); + goto build_lower_tier_mask; + } disable_all_migrate_targets(); @@ -549,6 +591,29 @@ static void establish_migration_targets(void) } } while (1); } +build_lower_tier_mask: + /* + * Now build the lower_tier mask for each node collecting node mask from + * all memory tier below it. This allows us to fallback demotion page + * allocation to a set of nodes that is closer the above selected + * perferred node. + */ + list_for_each_entry(memtier, &memory_tiers, list) + nodes_or(lower_tier, lower_tier, memtier->nodelist); + /* + * Removes nodes not yet in N_MEMORY. + */ + nodes_and(lower_tier, node_states[N_MEMORY], lower_tier); + + list_for_each_entry(memtier, &memory_tiers, list) { + /* + * Keep removing current tier from lower_tier nodes, + * This will remove all nodes in current and above + * memory tier from the lower_tier mask. + */ + nodes_andnot(lower_tier, lower_tier, memtier->nodelist); + memtier->lower_tier_mask = lower_tier; + } } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 3a8f78277f99..2b213248effa 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1460,19 +1460,32 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } -static struct page *alloc_demote_page(struct page *page, unsigned long node) +static struct page *alloc_demote_page(struct page *page, unsigned long private) { - struct migration_target_control mtc = { - /* - * Allocate from 'node', or fail quickly and quietly. - * When this happens, 'page' will likely just be discarded - * instead of migrated. - */ - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | - __GFP_THISNODE | __GFP_NOWARN | - __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid = node - }; + struct page *target_page; + nodemask_t *allowed_mask; + struct migration_target_control *mtc; + + mtc = (struct migration_target_control *)private; + + allowed_mask = mtc->nmask; + /* + * make sure we allocate from the target node first also trying to + * reclaim pages from the target node via kswapd if we are low on + * free memory on target node. If we don't do this and if we have low + * free memory on the target memtier, we would start allocating pages + * from higher memory tiers without even forcing a demotion of cold + * pages from the target memtier. This can result in the kernel placing + * hotpages in higher memory tiers. + */ + mtc->nmask = NULL; + mtc->gfp_mask |= __GFP_THISNODE; + target_page = alloc_migration_target(page, (unsigned long)&mtc); + if (target_page) + return target_page; + + mtc->gfp_mask &= ~__GFP_THISNODE; + mtc->nmask = allowed_mask; return alloc_migration_target(page, (unsigned long)&mtc); } @@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; + nodemask_t allowed_mask; + + struct migration_target_control mtc = { + /* + * Allocate from 'node', or fail quickly and quietly. + * When this happens, 'page' will likely just be discarded + * instead of migrated. + */ + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | + __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid = target_nid, + .nmask = &allowed_mask + }; if (list_empty(demote_pages)) return 0; @@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, if (target_nid == NUMA_NO_NODE) return 0; + node_get_allowed_targets(pgdat, &allowed_mask); + /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_pages, alloc_demote_page, NULL, - target_nid, MIGRATE_ASYNC, MR_DEMOTION, - &nr_succeeded); + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, + &nr_succeeded); if (current_is_kswapd()) __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);