From patchwork Wed Oct 16 22:11:49 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 11194511 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0035376 for ; Wed, 16 Oct 2019 22:14:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C164B207FF for ; Wed, 16 Oct 2019 22:14:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C164B207FF Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 1248C8E0006; Wed, 16 Oct 2019 18:14:14 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 0FDFA8E0001; Wed, 16 Oct 2019 18:14:14 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F07FB8E0006; Wed, 16 Oct 2019 18:14:13 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0166.hostedemail.com [216.40.44.166]) by kanga.kvack.org (Postfix) with ESMTP id CF1A68E0001 for ; Wed, 16 Oct 2019 18:14:13 -0400 (EDT) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id 7335D6131 for ; Wed, 16 Oct 2019 22:14:13 +0000 (UTC) X-FDA: 76051051986.02.legs82_7ace5705bd42 X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,:linux-kernel@vger.kernel.org::dan.j.williams@intel.com:dave.hansen@linux.intel.com:keith.busch@intel.com,RULES_HIT:30051:30054:30064:30080,0,RBL:192.55.52.151:@linux.intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:27,LUA_SUMMARY:none X-HE-Tag: legs82_7ace5705bd42 X-Filterd-Recvd-Size: 6094 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by imf31.hostedemail.com (Postfix) with ESMTP for ; Wed, 16 Oct 2019 22:14:12 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by fmsmga107.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 16 Oct 2019 15:14:11 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.67,305,1566889200"; d="scan'208";a="195725844" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by fmsmga007.fm.intel.com with ESMTP; 16 Oct 2019 15:14:10 -0700 Subject: [PATCH 1/4] node: Define and export memory migration path To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,dan.j.williams@intel.com,Dave Hansen ,keith.busch@intel.com From: Dave Hansen Date: Wed, 16 Oct 2019 15:11:49 -0700 References: <20191016221148.F9CCD155@viggo.jf.intel.com> In-Reply-To: <20191016221148.F9CCD155@viggo.jf.intel.com> Message-Id: <20191016221149.74AE222C@viggo.jf.intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Keith Busch Prepare for the kernel to auto-migrate pages to other memory nodes with a user defined node migration table. This allows creating single migration target for each NUMA node to enable the kernel to do NUMA page migrations instead of simply reclaiming colder pages. A node with no target is a "terminal node", so reclaim acts normally there. The migration target does not fundamentally _need_ to be a single node, but this implementation starts there to limit complexity. If you consider the migration path as a graph, cycles (loops) in the graph are disallowed. This avoids wasting resources by constantly migrating (A->B, B->A, A->B ...). The expectation is that cycles will never be allowed, and this rule is enforced if the user tries to make such a cycle. Signed-off-by: Keith Busch Signed-off-by: Dave Hansen --- b/drivers/base/node.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++ b/include/linux/node.h | 6 ++++ 2 files changed, 79 insertions(+) diff -puN drivers/base/node.c~0003-node-Define-and-export-memory-migration-path drivers/base/node.c --- a/drivers/base/node.c~0003-node-Define-and-export-memory-migration-path 2019-10-16 15:06:55.895952599 -0700 +++ b/drivers/base/node.c 2019-10-16 15:06:55.902952599 -0700 @@ -101,6 +101,10 @@ static const struct attribute_group *nod NULL, }; +#define TERMINAL_NODE -1 +static int node_migration[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = TERMINAL_NODE}; +static DEFINE_SPINLOCK(node_migration_lock); + static void node_remove_accesses(struct node *node) { struct node_access_nodes *c, *cnext; @@ -530,6 +534,74 @@ static ssize_t node_read_distance(struct } static DEVICE_ATTR(distance, S_IRUGO, node_read_distance, NULL); +static ssize_t migration_path_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + return sprintf(buf, "%d\n", node_migration[dev->id]); +} + +static ssize_t migration_path_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + int i, err, nid = dev->id; + nodemask_t visited = NODE_MASK_NONE; + long next; + + err = kstrtol(buf, 0, &next); + if (err) + return -EINVAL; + + if (next < 0) { + spin_lock(&node_migration_lock); + WRITE_ONCE(node_migration[nid], TERMINAL_NODE); + spin_unlock(&node_migration_lock); + return count; + } + if (next >= MAX_NUMNODES || !node_online(next)) + return -EINVAL; + + /* + * Follow the entire migration path from 'nid' through the point where + * we hit a TERMINAL_NODE. + * + * Don't allow loops migration cycles in the path. + */ + node_set(nid, visited); + spin_lock(&node_migration_lock); + for (i = next; node_migration[i] != TERMINAL_NODE; + i = node_migration[i]) { + /* Fail if we have visited this node already */ + if (node_test_and_set(i, visited)) { + spin_unlock(&node_migration_lock); + return -EINVAL; + } + } + WRITE_ONCE(node_migration[nid], next); + spin_unlock(&node_migration_lock); + + return count; +} +static DEVICE_ATTR_RW(migration_path); + +/** + * next_migration_node() - Get the next node in the migration path + * @current_node: The starting node to lookup the next node + * + * @returns: node id for next memory node in the migration path hierarchy from + * @current_node; -1 if @current_node is terminal or its migration + * node is not online. + */ +int next_migration_node(int current_node) +{ + int nid = READ_ONCE(node_migration[current_node]); + + if (nid >= 0 && node_online(nid)) + return nid; + return TERMINAL_NODE; +} + static struct attribute *node_dev_attrs[] = { &dev_attr_cpumap.attr, &dev_attr_cpulist.attr, @@ -537,6 +609,7 @@ static struct attribute *node_dev_attrs[ &dev_attr_numastat.attr, &dev_attr_distance.attr, &dev_attr_vmstat.attr, + &dev_attr_migration_path.attr, NULL }; ATTRIBUTE_GROUPS(node_dev); diff -puN include/linux/node.h~0003-node-Define-and-export-memory-migration-path include/linux/node.h --- a/include/linux/node.h~0003-node-Define-and-export-memory-migration-path 2019-10-16 15:06:55.898952599 -0700 +++ b/include/linux/node.h 2019-10-16 15:06:55.902952599 -0700 @@ -134,6 +134,7 @@ static inline int register_one_node(int return error; } +extern int next_migration_node(int current_node); extern void unregister_one_node(int nid); extern int register_cpu_under_node(unsigned int cpu, unsigned int nid); extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid); @@ -186,6 +187,11 @@ static inline void register_hugetlbfs_wi node_registration_func_t unreg) { } + +static inline int next_migration_node(int current_node) +{ + return -1; +} #endif #define to_node(device) container_of(device, struct node, dev) From patchwork Wed Oct 16 22:11:51 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 11194513 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9AA0176 for ; Wed, 16 Oct 2019 22:14:19 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6769421D80 for ; Wed, 16 Oct 2019 22:14:19 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6769421D80 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 1C2958E0007; Wed, 16 Oct 2019 18:14:15 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 14A7E8E0001; Wed, 16 Oct 2019 18:14:15 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 038D18E0007; Wed, 16 Oct 2019 18:14:14 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0254.hostedemail.com [216.40.44.254]) by kanga.kvack.org (Postfix) with ESMTP id D376B8E0001 for ; Wed, 16 Oct 2019 18:14:14 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 6F0DB181AC212 for ; Wed, 16 Oct 2019 22:14:14 +0000 (UTC) X-FDA: 76051052028.15.sort75_7d7549512c46 X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,:linux-kernel@vger.kernel.org::dan.j.williams@intel.com:dave.hansen@linux.intel.com:keith.busch@intel.com,RULES_HIT:30012:30045:30054:30064:30070:30090,0,RBL:192.55.52.151:@linux.intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:25,LUA_SUMMARY:none X-HE-Tag: sort75_7d7549512c46 X-Filterd-Recvd-Size: 9058 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by imf31.hostedemail.com (Postfix) with ESMTP for ; Wed, 16 Oct 2019 22:14:13 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga107.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 16 Oct 2019 15:14:12 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.67,305,1566889200"; d="scan'208";a="347561461" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by orsmga004.jf.intel.com with ESMTP; 16 Oct 2019 15:14:12 -0700 Subject: [PATCH 2/4] mm/migrate: Defer allocating new page until needed To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,dan.j.williams@intel.com,Dave Hansen ,keith.busch@intel.com From: Dave Hansen Date: Wed, 16 Oct 2019 15:11:51 -0700 References: <20191016221148.F9CCD155@viggo.jf.intel.com> In-Reply-To: <20191016221148.F9CCD155@viggo.jf.intel.com> Message-Id: <20191016221151.854D5735@viggo.jf.intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Keith Busch Migrating pages had been allocating the new page before it was actually needed. Subsequent operations may still fail, which would have to handle cleaning up the newly allocated page when it was never used. Defer allocating the page until we are actually ready to make use of it, after locking the original page. This simplifies error handling, but should not have any functional change in behavior. This is just refactoring page migration so the main part can more easily be reused by other code. Signed-off-by: Keith Busch Signed-off-by: Dave Hansen --- b/mm/migrate.c | 154 ++++++++++++++++++++++++++++----------------------------- 1 file changed, 76 insertions(+), 78 deletions(-) diff -puN mm/migrate.c~0004-mm-migrate-Defer-allocating-new-page-until-needed mm/migrate.c --- a/mm/migrate.c~0004-mm-migrate-Defer-allocating-new-page-until-needed 2019-10-16 15:06:57.032952596 -0700 +++ b/mm/migrate.c 2019-10-16 15:06:57.037952596 -0700 @@ -1005,56 +1005,17 @@ out: return rc; } -static int __unmap_and_move(struct page *page, struct page *newpage, - int force, enum migrate_mode mode) +static int __unmap_and_move(new_page_t get_new_page, + free_page_t put_new_page, + unsigned long private, struct page *page, + enum migrate_mode mode, + enum migrate_reason reason) { int rc = -EAGAIN; int page_was_mapped = 0; struct anon_vma *anon_vma = NULL; bool is_lru = !__PageMovable(page); - - if (!trylock_page(page)) { - if (!force || mode == MIGRATE_ASYNC) - goto out; - - /* - * It's not safe for direct compaction to call lock_page. - * For example, during page readahead pages are added locked - * to the LRU. Later, when the IO completes the pages are - * marked uptodate and unlocked. However, the queueing - * could be merging multiple pages for one bio (e.g. - * mpage_readpages). If an allocation happens for the - * second or third page, the process can end up locking - * the same page twice and deadlocking. Rather than - * trying to be clever about what pages can be locked, - * avoid the use of lock_page for direct compaction - * altogether. - */ - if (current->flags & PF_MEMALLOC) - goto out; - - lock_page(page); - } - - if (PageWriteback(page)) { - /* - * Only in the case of a full synchronous migration is it - * necessary to wait for PageWriteback. In the async case, - * the retry loop is too short and in the sync-light case, - * the overhead of stalling is too much - */ - switch (mode) { - case MIGRATE_SYNC: - case MIGRATE_SYNC_NO_COPY: - break; - default: - rc = -EBUSY; - goto out_unlock; - } - if (!force) - goto out_unlock; - wait_on_page_writeback(page); - } + struct page *newpage; /* * By try_to_unmap(), page->mapcount goes down to 0 here. In this case, @@ -1073,6 +1034,12 @@ static int __unmap_and_move(struct page if (PageAnon(page) && !PageKsm(page)) anon_vma = page_get_anon_vma(page); + newpage = get_new_page(page, private); + if (!newpage) { + rc = -ENOMEM; + goto out; + } + /* * Block others from accessing the new page when we get around to * establishing additional references. We are usually the only one @@ -1082,11 +1049,11 @@ static int __unmap_and_move(struct page * This is much like races on refcount of oldpage: just don't BUG(). */ if (unlikely(!trylock_page(newpage))) - goto out_unlock; + goto out_put; if (unlikely(!is_lru)) { rc = move_to_new_page(newpage, page, mode); - goto out_unlock_both; + goto out_unlock; } /* @@ -1105,7 +1072,7 @@ static int __unmap_and_move(struct page VM_BUG_ON_PAGE(PageAnon(page), page); if (page_has_private(page)) { try_to_free_buffers(page); - goto out_unlock_both; + goto out_unlock; } } else if (page_mapped(page)) { /* Establish migration ptes */ @@ -1122,15 +1089,9 @@ static int __unmap_and_move(struct page if (page_was_mapped) remove_migration_ptes(page, rc == MIGRATEPAGE_SUCCESS ? newpage : page, false); - -out_unlock_both: - unlock_page(newpage); out_unlock: - /* Drop an anon_vma reference if we took one */ - if (anon_vma) - put_anon_vma(anon_vma); - unlock_page(page); -out: + unlock_page(newpage); +out_put: /* * If migration is successful, decrease refcount of the newpage * which will not free the page because new page owner increased @@ -1141,12 +1102,20 @@ out: * state. */ if (rc == MIGRATEPAGE_SUCCESS) { + set_page_owner_migrate_reason(newpage, reason); if (unlikely(!is_lru)) put_page(newpage); else putback_lru_page(newpage); + } else if (put_new_page) { + put_new_page(newpage, private); + } else { + put_page(newpage); } - +out: + /* Drop an anon_vma reference if we took one */ + if (anon_vma) + put_anon_vma(anon_vma); return rc; } @@ -1171,16 +1140,11 @@ static ICE_noinline int unmap_and_move(n int force, enum migrate_mode mode, enum migrate_reason reason) { - int rc = MIGRATEPAGE_SUCCESS; - struct page *newpage; + int rc = -EAGAIN; if (!thp_migration_supported() && PageTransHuge(page)) return -ENOMEM; - newpage = get_new_page(page, private); - if (!newpage) - return -ENOMEM; - if (page_count(page) == 1) { /* page was freed from under us. So we are done. */ ClearPageActive(page); @@ -1191,17 +1155,57 @@ static ICE_noinline int unmap_and_move(n __ClearPageIsolated(page); unlock_page(page); } - if (put_new_page) - put_new_page(newpage, private); - else - put_page(newpage); + rc = MIGRATEPAGE_SUCCESS; goto out; } - rc = __unmap_and_move(page, newpage, force, mode); - if (rc == MIGRATEPAGE_SUCCESS) - set_page_owner_migrate_reason(newpage, reason); + if (!trylock_page(page)) { + if (!force || mode == MIGRATE_ASYNC) + return rc; + + /* + * It's not safe for direct compaction to call lock_page. + * For example, during page readahead pages are added locked + * to the LRU. Later, when the IO completes the pages are + * marked uptodate and unlocked. However, the queueing + * could be merging multiple pages for one bio (e.g. + * mpage_readpages). If an allocation happens for the + * second or third page, the process can end up locking + * the same page twice and deadlocking. Rather than + * trying to be clever about what pages can be locked, + * avoid the use of lock_page for direct compaction + * altogether. + */ + if (current->flags & PF_MEMALLOC) + return rc; + lock_page(page); + } + + if (PageWriteback(page)) { + /* + * Only in the case of a full synchronous migration is it + * necessary to wait for PageWriteback. In the async case, + * the retry loop is too short and in the sync-light case, + * the overhead of stalling is too much + */ + switch (mode) { + case MIGRATE_SYNC: + case MIGRATE_SYNC_NO_COPY: + break; + default: + rc = -EBUSY; + goto out_unlock; + } + if (!force) + goto out_unlock; + wait_on_page_writeback(page); + } + rc = __unmap_and_move(get_new_page, put_new_page, private, + page, mode, reason); + +out_unlock: + unlock_page(page); out: if (rc != -EAGAIN) { /* @@ -1242,9 +1246,8 @@ out: if (rc != -EAGAIN) { if (likely(!__PageMovable(page))) { putback_lru_page(page); - goto put_new; + goto done; } - lock_page(page); if (PageMovable(page)) putback_movable_page(page); @@ -1253,13 +1256,8 @@ out: unlock_page(page); put_page(page); } -put_new: - if (put_new_page) - put_new_page(newpage, private); - else - put_page(newpage); } - +done: return rc; } From patchwork Wed Oct 16 22:11:52 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 11194517 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CF4811668 for ; Wed, 16 Oct 2019 22:14:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 99699207FF for ; Wed, 16 Oct 2019 22:14:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 99699207FF Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2D2EF8E0009; Wed, 16 Oct 2019 18:14:23 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 2881E8E0001; Wed, 16 Oct 2019 18:14:23 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 19E988E0009; Wed, 16 Oct 2019 18:14:23 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0226.hostedemail.com [216.40.44.226]) by kanga.kvack.org (Postfix) with ESMTP id E7CDF8E0001 for ; Wed, 16 Oct 2019 18:14:22 -0400 (EDT) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id 63EE3612B for ; Wed, 16 Oct 2019 22:14:22 +0000 (UTC) X-FDA: 76051052364.09.unit63_900354f0e441 X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,:linux-kernel@vger.kernel.org::dan.j.williams@intel.com:dave.hansen@linux.intel.com:keith.busch@intel.com,RULES_HIT:30051:30054:30062:30064:30070:30083:30090:30091,0,RBL:134.134.136.20:@linux.intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:25,LUA_SUMMARY:none X-HE-Tag: unit63_900354f0e441 X-Filterd-Recvd-Size: 8341 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by imf13.hostedemail.com (Postfix) with ESMTP for ; Wed, 16 Oct 2019 22:14:21 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 16 Oct 2019 15:14:14 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.67,305,1566889200"; d="scan'208";a="200197936" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by orsmga006.jf.intel.com with ESMTP; 16 Oct 2019 15:14:13 -0700 Subject: [PATCH 3/4] mm/vmscan: Attempt to migrate page in lieu of discard To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,dan.j.williams@intel.com,Dave Hansen ,keith.busch@intel.com From: Dave Hansen Date: Wed, 16 Oct 2019 15:11:52 -0700 References: <20191016221148.F9CCD155@viggo.jf.intel.com> In-Reply-To: <20191016221148.F9CCD155@viggo.jf.intel.com> Message-Id: <20191016221152.BF2171A3@viggo.jf.intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Keith Busch If a memory node has a preferred migration path to demote cold pages, attempt to move those inactive pages to that migration node before reclaiming. This will better utilize available memory, provide a faster tier than swapping or discarding, and allow such pages to be reused immediately without IO to retrieve the data. Much like swap, this is an opt-in feature that requires user defining where to send pages when reclaiming them. When handling anonymous pages, this will be considered before swap if enabled. Should the demotion fail for any reason, the page reclaim will proceed as if the demotion feature was not enabled. Some places we would like to see this used: 1. Persistent memory being as a slower, cheaper DRAM replacement 2. Remote memory-only "expansion" NUMA nodes 3. Resolving memory imbalances where one NUMA node is seeing more allocation activity than another. This helps keep more recent allocations closer to the CPUs on the node doing the allocating. Signed-off-by: Keith Busch Co-developed-by: Dave Hansen Signed-off-by: Dave Hansen --- b/include/linux/migrate.h | 6 ++++ b/include/trace/events/migrate.h | 3 +- b/mm/debug.c | 1 b/mm/migrate.c | 51 +++++++++++++++++++++++++++++++++++++++ b/mm/vmscan.c | 27 ++++++++++++++++++++ 5 files changed, 87 insertions(+), 1 deletion(-) diff -puN include/linux/migrate.h~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h --- a/include/linux/migrate.h~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2019-10-16 15:06:58.090952593 -0700 +++ b/include/linux/migrate.h 2019-10-16 15:06:58.103952593 -0700 @@ -25,6 +25,7 @@ enum migrate_reason { MR_MEMPOLICY_MBIND, MR_NUMA_MISPLACED, MR_CONTIG_RANGE, + MR_DEMOTION, MR_TYPES }; @@ -79,6 +80,7 @@ extern int migrate_huge_page_move_mappin extern int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, enum migrate_mode mode, int extra_count); +extern int migrate_demote_mapping(struct page *page); #else static inline void putback_movable_pages(struct list_head *l) {} @@ -105,6 +107,10 @@ static inline int migrate_huge_page_move return -ENOSYS; } +static inline int migrate_demote_mapping(struct page *page) +{ + return -ENOSYS; +} #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_COMPACTION diff -puN include/trace/events/migrate.h~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h --- a/include/trace/events/migrate.h~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2019-10-16 15:06:58.092952593 -0700 +++ b/include/trace/events/migrate.h 2019-10-16 15:06:58.103952593 -0700 @@ -20,7 +20,8 @@ EM( MR_SYSCALL, "syscall_or_cpuset") \ EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \ EM( MR_NUMA_MISPLACED, "numa_misplaced") \ - EMe(MR_CONTIG_RANGE, "contig_range") + EM( MR_CONTIG_RANGE, "contig_range") \ + EMe(MR_DEMOTION, "demotion") /* * First define the enums in the above macros to be exported to userspace diff -puN mm/debug.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c --- a/mm/debug.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2019-10-16 15:06:58.094952593 -0700 +++ b/mm/debug.c 2019-10-16 15:06:58.103952593 -0700 @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE "mempolicy_mbind", "numa_misplaced", "cma", + "demotion", }; const struct trace_print_flags pageflag_names[] = { diff -puN mm/migrate.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c --- a/mm/migrate.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2019-10-16 15:06:58.097952593 -0700 +++ b/mm/migrate.c 2019-10-16 15:06:58.104952593 -0700 @@ -1119,6 +1119,57 @@ out: return rc; } +static struct page *alloc_demote_node_page(struct page *page, unsigned long node) +{ + /* + * The flags are set to allocate only on the desired node in the + * migration path, and to fail fast if not immediately available. We + * are already doing memory reclaim, we don't want heroic efforts to + * get a page. + */ + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY | + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_MOVABLE; + struct page *newpage; + + if (PageTransHuge(page)) { + mask |= __GFP_COMP; + newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER); + if (newpage) + prep_transhuge_page(newpage); + } else + newpage = alloc_pages_node(node, mask, 0); + + return newpage; +} + +/** + * migrate_demote_mapping() - Migrate this page and its mappings to its + * demotion node. + * @page: A locked, isolated, non-huge page that should migrate to its current + * node's demotion target, if available. Since this is intended to be + * called during memory reclaim, all flag options are set to fail fast. + * + * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise. + */ +int migrate_demote_mapping(struct page *page) +{ + int next_nid = next_migration_node(page_to_nid(page)); + + VM_BUG_ON_PAGE(!PageLocked(page), page); + VM_BUG_ON_PAGE(PageHuge(page), page); + VM_BUG_ON_PAGE(PageLRU(page), page); + + if (next_nid < 0) + return -ENOSYS; + if (PageTransHuge(page) && !thp_migration_supported()) + return -ENOMEM; + + /* MIGRATE_ASYNC is the most light weight and never blocks.*/ + return __unmap_and_move(alloc_demote_node_page, NULL, next_nid, + page, MIGRATE_ASYNC, MR_DEMOTION); +} + + /* * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move(). Work * around it. diff -puN mm/vmscan.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c --- a/mm/vmscan.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2019-10-16 15:06:58.099952593 -0700 +++ b/mm/vmscan.c 2019-10-16 15:06:58.105952593 -0700 @@ -1262,6 +1262,33 @@ static unsigned long shrink_page_list(st ; /* try to reclaim the page below */ } + if (!PageHuge(page)) { + int rc = migrate_demote_mapping(page); + + /* + * -ENOMEM on a THP may indicate either migration is + * unsupported or there was not enough contiguous + * space. Split the THP into base pages and retry the + * head immediately. The tail pages will be considered + * individually within the current loop's page list. + */ + if (rc == -ENOMEM && PageTransHuge(page) && + !split_huge_page_to_list(page, page_list)) + rc = migrate_demote_mapping(page); + + if (rc == MIGRATEPAGE_SUCCESS) { + unlock_page(page); + if (likely(put_page_testzero(page))) + goto free_it; + /* + * Speculative reference will free this page, + * so leave it off the LRU. + */ + nr_reclaimed++; + continue; + } + } + /* * Anonymous process memory has backing store? * Try to allocate it some swap space here. From patchwork Wed Oct 16 22:11:54 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 11194515 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0471F76 for ; Wed, 16 Oct 2019 22:14:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CFE6D207FF for ; Wed, 16 Oct 2019 22:14:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CFE6D207FF Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 51BA78E0008; Wed, 16 Oct 2019 18:14:18 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 4CF508E0001; Wed, 16 Oct 2019 18:14:18 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 397518E0008; Wed, 16 Oct 2019 18:14:18 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0035.hostedemail.com [216.40.44.35]) by kanga.kvack.org (Postfix) with ESMTP id 10BA88E0001 for ; Wed, 16 Oct 2019 18:14:18 -0400 (EDT) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id B8B35824556B for ; Wed, 16 Oct 2019 22:14:17 +0000 (UTC) X-FDA: 76051052154.04.meal71_850487583c60 X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,:linux-kernel@vger.kernel.org::dan.j.williams@intel.com:dave.hansen@linux.intel.com:keith.busch@intel.com,RULES_HIT:30004:30054:30064,0,RBL:192.55.52.93:@linux.intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:25,LUA_SUMMARY:none X-HE-Tag: meal71_850487583c60 X-Filterd-Recvd-Size: 4568 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by imf42.hostedemail.com (Postfix) with ESMTP for ; Wed, 16 Oct 2019 22:14:16 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 16 Oct 2019 15:14:15 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.67,305,1566889200"; d="scan'208";a="225945438" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by fmsmga002.fm.intel.com with ESMTP; 16 Oct 2019 15:14:15 -0700 Subject: [PATCH 4/4] mm/vmscan: Consider anonymous pages without swap To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,dan.j.williams@intel.com,Dave Hansen ,keith.busch@intel.com From: Dave Hansen Date: Wed, 16 Oct 2019 15:11:54 -0700 References: <20191016221148.F9CCD155@viggo.jf.intel.com> In-Reply-To: <20191016221148.F9CCD155@viggo.jf.intel.com> Message-Id: <20191016221154.CDD7064D@viggo.jf.intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Keith Busch Age and reclaim anonymous pages if a migration path is available. The node has other recourses for inactive anonymous pages beyond swap, Signed-off-by: Keith Busch Co-developed-by: Dave Hansen Signed-off-by: Dave Hansen --- b/include/linux/swap.h | 20 ++++++++++++++++++++ b/mm/vmscan.c | 10 +++++----- 2 files changed, 25 insertions(+), 5 deletions(-) diff -puN include/linux/swap.h~0006-mm-vmscan-Consider-anonymous-pages-without-swap include/linux/swap.h --- a/include/linux/swap.h~0006-mm-vmscan-Consider-anonymous-pages-without-swap 2019-10-16 15:06:59.474952590 -0700 +++ b/include/linux/swap.h 2019-10-16 15:06:59.481952590 -0700 @@ -680,5 +680,25 @@ static inline bool mem_cgroup_swap_full( } #endif +static inline bool reclaim_anon_pages(struct mem_cgroup *memcg, + int node_id) +{ + /* Always age anon pages when we have swap */ + if (memcg == NULL) { + if (get_nr_swap_pages() > 0) + return true; + } else { + if (mem_cgroup_get_nr_swap_pages(memcg) > 0) + return true; + } + + /* Also age anon pages if we can auto-migrate them */ + if (next_migration_node(node_id) >= 0) + return true; + + /* No way to reclaim anon pages */ + return false; +} + #endif /* __KERNEL__*/ #endif /* _LINUX_SWAP_H */ diff -puN mm/vmscan.c~0006-mm-vmscan-Consider-anonymous-pages-without-swap mm/vmscan.c --- a/mm/vmscan.c~0006-mm-vmscan-Consider-anonymous-pages-without-swap 2019-10-16 15:06:59.477952590 -0700 +++ b/mm/vmscan.c 2019-10-16 15:06:59.482952590 -0700 @@ -327,7 +327,7 @@ unsigned long zone_reclaimable_pages(str nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) + zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE); - if (get_nr_swap_pages() > 0) + if (reclaim_anon_pages(NULL, zone_to_nid(zone))) nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) + zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON); @@ -2166,7 +2166,7 @@ static bool inactive_list_is_low(struct * If we don't have swap space, anonymous page deactivation * is pointless. */ - if (!file && !total_swap_pages) + if (!file && !reclaim_anon_pages(NULL, pgdat->node_id)) return false; inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx); @@ -2241,7 +2241,7 @@ static void get_scan_count(struct lruvec enum lru_list lru; /* If we have no swap space, do not bother scanning anon pages. */ - if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) { + if (!sc->may_swap || !reclaim_anon_pages(memcg, pgdat->node_id)) { scan_balance = SCAN_FILE; goto out; } @@ -2604,7 +2604,7 @@ static inline bool should_continue_recla */ pages_for_compaction = compact_gap(sc->order); inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE); - if (get_nr_swap_pages() > 0) + if (!reclaim_anon_pages(NULL, pgdat->node_id)) inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON); if (sc->nr_reclaimed < pages_for_compaction && inactive_lru_pages > pages_for_compaction) @@ -3289,7 +3289,7 @@ static void age_active_anon(struct pglis { struct mem_cgroup *memcg; - if (!total_swap_pages) + if (!reclaim_anon_pages(NULL, pgdat->node_id)) return; memcg = mem_cgroup_iter(NULL, NULL, NULL);