From patchwork Mon Jun 29 23:45:05 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Hansen <dave.hansen@linux.intel.com>
X-Patchwork-Id: 11632865
Return-Path: <SRS0=KGKx=AK=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 12E966C1
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:45 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id DAEE820781
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:44 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DAEE820781
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=linux.intel.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 0C9F68D001B; Mon, 29 Jun 2020 19:48:39 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 07AD18D001D; Mon, 29 Jun 2020 19:48:39 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id EAB1C8D001B; Mon, 29 Jun 2020 19:48:38 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0147.hostedemail.com
 [216.40.44.147])
	by kanga.kvack.org (Postfix) with ESMTP id D5F128D001D
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 19:48:38 -0400 (EDT)
Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 9E633180AD804
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:38 +0000 (UTC)
X-FDA: 76983891516.11.women73_47017a626e73
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin11.hostedemail.com (Postfix) with ESMTP id E4A50180F8B86
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:31 +0000 (UTC)
X-Spam-Summary: 
 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30051:30054:30064:30080,0,RBL:134.134.136.100:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95
 62.18.0.100;04yfiic63s6ne9tm4tcui6wgg7abuycgfijpkzw4z7hbxipqxft1jpxye3n7f7r.7rxiwxndyzfgrucftnxrknfr6do5dsxd8973918suncc85c4688ps4n4cd5b6sk.k-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:23,LUA_SUMMARY:none
X-HE-Tag: women73_47017a626e73
X-Filterd-Recvd-Size: 3619
Received: from mga07.intel.com (mga07.intel.com [134.134.136.100])
	by imf40.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:31 +0000 (UTC)
IronPort-SDR: 
 mweNTho7ZybSJERSA8zgDDNOFKqajCcZuzX+JAmtY2ubPLTH+54rIVPi5vpRuwcOqtlw1r3m0d
 wbVVDOY9uQzQ==
X-IronPort-AV: E=McAfee;i="6000,8403,9666"; a="211173252"
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="211173252"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga003.fm.intel.com ([10.253.24.29])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 29 Jun 2020 16:48:29 -0700
IronPort-SDR: 
 VRSKSVr09Pa/rZrlzH3Vjr4iagEfzf6Hd6WnqUdLcWNMFkdi8glBXAu3Kbl6/7mX0Gnyk6Vqgw
 g2B5KkqF6F4A==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="320772112"
Received: from viggo.jf.intel.com (HELO localhost.localdomain)
 ([10.54.77.144])
  by FMSMGA003.fm.intel.com with ESMTP; 29 Jun 2020 16:48:29 -0700
Subject: [RFC][PATCH 1/8] mm/numa: node demotion data structure and lookup
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,Dave Hansen
 <dave.hansen@linux.intel.com>,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com
From: Dave Hansen <dave.hansen@linux.intel.com>
Date: Mon, 29 Jun 2020 16:45:05 -0700
References: <20200629234503.749E5340@viggo.jf.intel.com>
In-Reply-To: <20200629234503.749E5340@viggo.jf.intel.com>
Message-Id: <20200629234505.6ABCBDF4@viggo.jf.intel.com>
X-Rspamd-Queue-Id: E4A50180F8B86
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam03
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Dave Hansen <dave.hansen@linux.intel.com>

Prepare for the kernel to auto-migrate pages to other memory nodes
with a user defined node migration table. This allows creating single
migration target for each NUMA node to enable the kernel to do NUMA
page migrations instead of simply reclaiming colder pages. A node
with no target is a "terminal node", so reclaim acts normally there.
The migration target does not fundamentally _need_ to be a single node,
but this implementation starts there to limit complexity.

If you consider the migration path as a graph, cycles (loops) in the
graph are disallowed.  This avoids wasting resources by constantly
migrating (A->B, B->A, A->B ...).  The expectation is that cycles will
never be allowed.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/mm/migrate.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff -puN mm/migrate.c~0006-node-Define-and-export-memory-migration-path mm/migrate.c
--- a/mm/migrate.c~0006-node-Define-and-export-memory-migration-path	2020-06-29 16:34:36.849312609 -0700
+++ b/mm/migrate.c	2020-06-29 16:34:36.853312609 -0700
@@ -1159,6 +1159,29 @@ out:
 	return rc;
 }
 
+static int node_demotion[MAX_NUMNODES] = {[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
+
+/**
+ * next_demotion_node() - Get the next node in the demotion path
+ * @node: The starting node to lookup the next node
+ *
+ * @returns: node id for next memory node in the demotion path hierarchy
+ * from @node; -1 if @node is terminal
+ */
+int next_demotion_node(int node)
+{
+	get_online_mems();
+	while (true) {
+		node = node_demotion[node];
+		if (node == NUMA_NO_NODE)
+			break;
+		if (node_online(node))
+			break;
+	}
+	put_online_mems();
+	return node;
+}
+
 /*
  * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move().  Work
  * around it.

From patchwork Mon Jun 29 23:45:07 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Hansen <dave.hansen@linux.intel.com>
X-Patchwork-Id: 11632859
Return-Path: <SRS0=KGKx=AK=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id F0B496C1
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:38 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id BAB0120781
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:38 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BAB0120781
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=linux.intel.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id CD56E8D001A; Mon, 29 Jun 2020 19:48:37 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C87A36B008C; Mon, 29 Jun 2020 19:48:37 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BC4148D001A; Mon, 29 Jun 2020 19:48:37 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0163.hostedemail.com
 [216.40.44.163])
	by kanga.kvack.org (Postfix) with ESMTP id A6E916B0089
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 19:48:37 -0400 (EDT)
Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 719FD181AC9C6
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:37 +0000 (UTC)
X-FDA: 76983891474.14.rail35_0303ec526e73
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin14.hostedemail.com (Postfix) with ESMTP id 4446318229818
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:37 +0000 (UTC)
X-Spam-Summary: 
 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30012:30045:30054:30064:30070:30090,0,RBL:134.134.136.20:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95
 62.18.0.100;04yfrgxgg3cb7qkauefhtns1cm9efycxfzmkqgw5jecfpeu8fmwteeppdzrkghx.mzgt73ndtmnoch71byo66dj5eunbeh64z8aq53reyppjunfhok56z1qijta5fkr.4-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:62,LUA_SUMMARY:none
X-HE-Tag: rail35_0303ec526e73
X-Filterd-Recvd-Size: 9480
Received: from mga02.intel.com (mga02.intel.com [134.134.136.20])
	by imf04.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:36 +0000 (UTC)
IronPort-SDR: 
 sEAGDthXdJ0Op4zFpiqyRNV8IND01wJsYZOq1dMbsMjUDNnc2MYVMlVhJoIQxO7ezs/siDtoLs
 g/wDTueDaI4g==
X-IronPort-AV: E=McAfee;i="6000,8403,9666"; a="134402965"
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="134402965"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 29 Jun 2020 16:48:31 -0700
IronPort-SDR: 
 3DjoyN++8LiPSXa9PvxNGIDMIgxrTeGxgSGD58gJlLuHdbVM4qG3L9wV8nfdlyI36qJvEFosLj
 aa/JJxnxt+AA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="320182264"
Received: from viggo.jf.intel.com (HELO localhost.localdomain)
 ([10.54.77.144])
  by orsmga007.jf.intel.com with ESMTP; 29 Jun 2020 16:48:31 -0700
Subject: [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,Dave Hansen
 <dave.hansen@linux.intel.com>,kbusch@kernel.org,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com
From: Dave Hansen <dave.hansen@linux.intel.com>
Date: Mon, 29 Jun 2020 16:45:07 -0700
References: <20200629234503.749E5340@viggo.jf.intel.com>
In-Reply-To: <20200629234503.749E5340@viggo.jf.intel.com>
Message-Id: <20200629234507.CA0FDE19@viggo.jf.intel.com>
X-Rspamd-Queue-Id: 4446318229818
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam03
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Keith Busch <kbusch@kernel.org>

Migrating pages had been allocating the new page before it was actually
needed. Subsequent operations may still fail, which would have to handle
cleaning up the newly allocated page when it was never used.

Defer allocating the page until we are actually ready to make use of
it, after locking the original page. This simplifies error handling,
but should not have any functional change in behavior. This is just
refactoring page migration so the main part can more easily be reused
by other code.

#Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/mm/migrate.c |  148 ++++++++++++++++++++++++++++-----------------------------
 1 file changed, 75 insertions(+), 73 deletions(-)

diff -puN mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed mm/migrate.c
--- a/mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed	2020-06-29 16:34:37.896312607 -0700
+++ b/mm/migrate.c	2020-06-29 16:34:37.900312607 -0700
@@ -1014,56 +1014,17 @@ out:
 	return rc;
 }
 
-static int __unmap_and_move(struct page *page, struct page *newpage,
-				int force, enum migrate_mode mode)
+static int __unmap_and_move(new_page_t get_new_page,
+			    free_page_t put_new_page,
+			    unsigned long private, struct page *page,
+			    enum migrate_mode mode,
+			    enum migrate_reason reason)
 {
 	int rc = -EAGAIN;
 	int page_was_mapped = 0;
 	struct anon_vma *anon_vma = NULL;
 	bool is_lru = !__PageMovable(page);
-
-	if (!trylock_page(page)) {
-		if (!force || mode == MIGRATE_ASYNC)
-			goto out;
-
-		/*
-		 * It's not safe for direct compaction to call lock_page.
-		 * For example, during page readahead pages are added locked
-		 * to the LRU. Later, when the IO completes the pages are
-		 * marked uptodate and unlocked. However, the queueing
-		 * could be merging multiple pages for one bio (e.g.
-		 * mpage_readpages). If an allocation happens for the
-		 * second or third page, the process can end up locking
-		 * the same page twice and deadlocking. Rather than
-		 * trying to be clever about what pages can be locked,
-		 * avoid the use of lock_page for direct compaction
-		 * altogether.
-		 */
-		if (current->flags & PF_MEMALLOC)
-			goto out;
-
-		lock_page(page);
-	}
-
-	if (PageWriteback(page)) {
-		/*
-		 * Only in the case of a full synchronous migration is it
-		 * necessary to wait for PageWriteback. In the async case,
-		 * the retry loop is too short and in the sync-light case,
-		 * the overhead of stalling is too much
-		 */
-		switch (mode) {
-		case MIGRATE_SYNC:
-		case MIGRATE_SYNC_NO_COPY:
-			break;
-		default:
-			rc = -EBUSY;
-			goto out_unlock;
-		}
-		if (!force)
-			goto out_unlock;
-		wait_on_page_writeback(page);
-	}
+	struct page *newpage;
 
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
@@ -1082,6 +1043,12 @@ static int __unmap_and_move(struct page
 	if (PageAnon(page) && !PageKsm(page))
 		anon_vma = page_get_anon_vma(page);
 
+	newpage = get_new_page(page, private);
+	if (!newpage) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
 	/*
 	 * Block others from accessing the new page when we get around to
 	 * establishing additional references. We are usually the only one
@@ -1091,11 +1058,11 @@ static int __unmap_and_move(struct page
 	 * This is much like races on refcount of oldpage: just don't BUG().
 	 */
 	if (unlikely(!trylock_page(newpage)))
-		goto out_unlock;
+		goto out_put;
 
 	if (unlikely(!is_lru)) {
 		rc = move_to_new_page(newpage, page, mode);
-		goto out_unlock_both;
+		goto out_unlock;
 	}
 
 	/*
@@ -1114,7 +1081,7 @@ static int __unmap_and_move(struct page
 		VM_BUG_ON_PAGE(PageAnon(page), page);
 		if (page_has_private(page)) {
 			try_to_free_buffers(page);
-			goto out_unlock_both;
+			goto out_unlock;
 		}
 	} else if (page_mapped(page)) {
 		/* Establish migration ptes */
@@ -1131,15 +1098,9 @@ static int __unmap_and_move(struct page
 	if (page_was_mapped)
 		remove_migration_ptes(page,
 			rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
-
-out_unlock_both:
-	unlock_page(newpage);
 out_unlock:
-	/* Drop an anon_vma reference if we took one */
-	if (anon_vma)
-		put_anon_vma(anon_vma);
-	unlock_page(page);
-out:
+	unlock_page(newpage);
+out_put:
 	/*
 	 * If migration is successful, decrease refcount of the newpage
 	 * which will not free the page because new page owner increased
@@ -1150,12 +1111,20 @@ out:
 	 * state.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
+		set_page_owner_migrate_reason(newpage, reason);
 		if (unlikely(!is_lru))
 			put_page(newpage);
 		else
 			putback_lru_page(newpage);
+	} else if (put_new_page) {
+		put_new_page(newpage, private);
+	} else {
+		put_page(newpage);
 	}
-
+out:
+	/* Drop an anon_vma reference if we took one */
+	if (anon_vma)
+		put_anon_vma(anon_vma);
 	return rc;
 }
 
@@ -1203,8 +1172,7 @@ static ICE_noinline int unmap_and_move(n
 				   int force, enum migrate_mode mode,
 				   enum migrate_reason reason)
 {
-	int rc = MIGRATEPAGE_SUCCESS;
-	struct page *newpage = NULL;
+	int rc = -EAGAIN;
 
 	if (!thp_migration_supported() && PageTransHuge(page))
 		return -ENOMEM;
@@ -1219,17 +1187,57 @@ static ICE_noinline int unmap_and_move(n
 				__ClearPageIsolated(page);
 			unlock_page(page);
 		}
+		rc = MIGRATEPAGE_SUCCESS;
 		goto out;
 	}
 
-	newpage = get_new_page(page, private);
-	if (!newpage)
-		return -ENOMEM;
+	if (!trylock_page(page)) {
+		if (!force || mode == MIGRATE_ASYNC)
+			return rc;
 
-	rc = __unmap_and_move(page, newpage, force, mode);
-	if (rc == MIGRATEPAGE_SUCCESS)
-		set_page_owner_migrate_reason(newpage, reason);
+		/*
+		 * It's not safe for direct compaction to call lock_page.
+		 * For example, during page readahead pages are added locked
+		 * to the LRU. Later, when the IO completes the pages are
+		 * marked uptodate and unlocked. However, the queueing
+		 * could be merging multiple pages for one bio (e.g.
+		 * mpage_readpages). If an allocation happens for the
+		 * second or third page, the process can end up locking
+		 * the same page twice and deadlocking. Rather than
+		 * trying to be clever about what pages can be locked,
+		 * avoid the use of lock_page for direct compaction
+		 * altogether.
+		 */
+		if (current->flags & PF_MEMALLOC)
+			return rc;
+
+		lock_page(page);
+	}
+
+	if (PageWriteback(page)) {
+		/*
+		 * Only in the case of a full synchronous migration is it
+		 * necessary to wait for PageWriteback. In the async case,
+		 * the retry loop is too short and in the sync-light case,
+		 * the overhead of stalling is too much
+		 */
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
+			rc = -EBUSY;
+			goto out_unlock;
+		}
+		if (!force)
+			goto out_unlock;
+		wait_on_page_writeback(page);
+	}
+	rc = __unmap_and_move(get_new_page, put_new_page, private,
+			      page, mode, reason);
 
+out_unlock:
+	unlock_page(page);
 out:
 	if (rc != -EAGAIN) {
 		/*
@@ -1269,9 +1277,8 @@ out:
 		if (rc != -EAGAIN) {
 			if (likely(!__PageMovable(page))) {
 				putback_lru_page(page);
-				goto put_new;
+				goto done;
 			}
-
 			lock_page(page);
 			if (PageMovable(page))
 				putback_movable_page(page);
@@ -1280,13 +1287,8 @@ out:
 			unlock_page(page);
 			put_page(page);
 		}
-put_new:
-		if (put_new_page)
-			put_new_page(newpage, private);
-		else
-			put_page(newpage);
 	}
-
+done:
 	return rc;
 }
 

From patchwork Mon Jun 29 23:45:09 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Hansen <dave.hansen@linux.intel.com>
X-Patchwork-Id: 11632857
Return-Path: <SRS0=KGKx=AK=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E1D1C6C1
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:35 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id ABC9A20780
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:35 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org ABC9A20780
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=linux.intel.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id D24B28D0019; Mon, 29 Jun 2020 19:48:34 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CD4C96B0089; Mon, 29 Jun 2020 19:48:34 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C119B8D0019; Mon, 29 Jun 2020 19:48:34 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0120.hostedemail.com
 [216.40.44.120])
	by kanga.kvack.org (Postfix) with ESMTP id ABCDD6B0085
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 19:48:34 -0400 (EDT)
Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 74F21181AC9C6
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:34 +0000 (UTC)
X-FDA: 76983891348.14.badge58_2b1394126e73
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin14.hostedemail.com (Postfix) with ESMTP id 45ABD18229818
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:34 +0000 (UTC)
X-Spam-Summary: 
 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30012:30054:30062:30064:30070:30090:30091,0,RBL:192.55.52.151:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95
 62.18.0.100;04ygbexiw7a8j5hsqa7fomp388gtnocbbzqx4dztmjfezn1nyr91p81388dq543.pxei45uye9ya4xba1udknwgeu1fags6wk3pode4e7tqscedx9w181hc7sae3e7y.a-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:23,LUA_SUMMARY:none
X-HE-Tag: badge58_2b1394126e73
X-Filterd-Recvd-Size: 9384
Received: from mga17.intel.com (mga17.intel.com [192.55.52.151])
	by imf20.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:33 +0000 (UTC)
IronPort-SDR: 
 0Zl+s1v/67zDSkWwoWZ8t899Xnfc3AIu7ZC9jGmcvqy9UDGZs4FYgEOIwtbfuER2tq1JCFQ7Nz
 8B7Z9j3pvzcw==
X-IronPort-AV: E=McAfee;i="6000,8403,9666"; a="126231411"
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="126231411"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
  by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 29 Jun 2020 16:48:33 -0700
IronPort-SDR: 
 rosivuZVqsR+f2mcpCVNi34KmZaj0dmcK4Th1oUitjdUfnoEQXXwVYQrB0S8Q+WvQJS6aHhxb3
 K1Ea70Nbwg9A==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="313216928"
Received: from viggo.jf.intel.com (HELO localhost.localdomain)
 ([10.54.77.144])
  by fmsmga002.fm.intel.com with ESMTP; 29 Jun 2020 16:48:32 -0700
Subject: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of
 discard
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,Dave Hansen
 <dave.hansen@linux.intel.com>,kbusch@kernel.org,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com
From: Dave Hansen <dave.hansen@linux.intel.com>
Date: Mon, 29 Jun 2020 16:45:09 -0700
References: <20200629234503.749E5340@viggo.jf.intel.com>
In-Reply-To: <20200629234503.749E5340@viggo.jf.intel.com>
Message-Id: <20200629234509.8F89C4EF@viggo.jf.intel.com>
X-Rspamd-Queue-Id: 45ABD18229818
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam05
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Dave Hansen <dave.hansen@linux.intel.com>

If a memory node has a preferred migration path to demote cold pages,
attempt to move those inactive pages to that migration node before
reclaiming. This will better utilize available memory, provide a faster
tier than swapping or discarding, and allow such pages to be reused
immediately without IO to retrieve the data.

When handling anonymous pages, this will be considered before swap if
enabled. Should the demotion fail for any reason, the page reclaim
will proceed as if the demotion feature was not enabled.

Some places we would like to see this used:

  1. Persistent memory being as a slower, cheaper DRAM replacement
  2. Remote memory-only "expansion" NUMA nodes
  3. Resolving memory imbalances where one NUMA node is seeing more
     allocation activity than another.  This helps keep more recent
     allocations closer to the CPUs on the node doing the allocating.

Yang Shi's patches used an alternative approach where to-be-discarded
pages were collected on a separate discard list and then discarded
as a batch with migrate_pages().  This results in simpler code and
has all the performance advantages of batching, but has the
disadvantage that pages which fail to migrate never get swapped.

#Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/include/linux/migrate.h        |    6 ++++
 b/include/trace/events/migrate.h |    3 +-
 b/mm/debug.c                     |    1 
 b/mm/migrate.c                   |   52 +++++++++++++++++++++++++++++++++++++++
 b/mm/vmscan.c                    |   25 ++++++++++++++++++
 5 files changed, 86 insertions(+), 1 deletion(-)

diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
--- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.950312604 -0700
+++ b/include/linux/migrate.h	2020-06-29 16:34:38.963312604 -0700
@@ -25,6 +25,7 @@ enum migrate_reason {
 	MR_MEMPOLICY_MBIND,
 	MR_NUMA_MISPLACED,
 	MR_CONTIG_RANGE,
+	MR_DEMOTION,
 	MR_TYPES
 };
 
@@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin
 				  struct page *newpage, struct page *page);
 extern int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page, int extra_count);
+extern int migrate_demote_mapping(struct page *page);
 #else
 
 static inline void putback_movable_pages(struct list_head *l) {}
@@ -104,6 +106,10 @@ static inline int migrate_huge_page_move
 	return -ENOSYS;
 }
 
+static inline int migrate_demote_mapping(struct page *page)
+{
+	return -ENOSYS;
+}
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_COMPACTION
diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
--- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.952312604 -0700
+++ b/include/trace/events/migrate.h	2020-06-29 16:34:38.963312604 -0700
@@ -20,7 +20,8 @@
 	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
 	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
 	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
-	EMe(MR_CONTIG_RANGE,	"contig_range")
+	EM( MR_CONTIG_RANGE,	"contig_range")			\
+	EMe(MR_DEMOTION,	"demotion")
 
 /*
  * First define the enums in the above macros to be exported to userspace
diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
--- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.954312604 -0700
+++ b/mm/debug.c	2020-06-29 16:34:38.963312604 -0700
@@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE
 	"mempolicy_mbind",
 	"numa_misplaced",
 	"cma",
+	"demotion",
 };
 
 const struct trace_print_flags pageflag_names[] = {
diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
--- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.956312604 -0700
+++ b/mm/migrate.c	2020-06-29 16:34:38.964312604 -0700
@@ -1151,6 +1151,58 @@ int next_demotion_node(int node)
 	return node;
 }
 
+static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
+{
+	/*
+	 * 'mask' targets allocation only to the desired node in the
+	 * migration path, and fails fast if the allocation can not be
+	 * immediately satisfied.  Reclaim is already active and heroic
+	 * allocation efforts are unwanted.
+	 */
+	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
+			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
+			__GFP_MOVABLE;
+	struct page *newpage;
+
+	if (PageTransHuge(page)) {
+		mask |= __GFP_COMP;
+		newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
+		if (newpage)
+			prep_transhuge_page(newpage);
+	} else
+		newpage = alloc_pages_node(node, mask, 0);
+
+	return newpage;
+}
+
+/**
+ * migrate_demote_mapping() - Migrate this page and its mappings to its
+ *                            demotion node.
+ * @page: A locked, isolated, non-huge page that should migrate to its current
+ *        node's demotion target, if available. Since this is intended to be
+ *        called during memory reclaim, all flag options are set to fail fast.
+ *
+ * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise.
+ */
+int migrate_demote_mapping(struct page *page)
+{
+	int next_nid = next_demotion_node(page_to_nid(page));
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(PageHuge(page), page);
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	if (next_nid == NUMA_NO_NODE)
+		return -ENOSYS;
+	if (PageTransHuge(page) && !thp_migration_supported())
+		return -ENOMEM;
+
+	/* MIGRATE_ASYNC is the most light weight and never blocks.*/
+	return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
+				page, MIGRATE_ASYNC, MR_DEMOTION);
+}
+
+
 /*
  * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move().  Work
  * around it.
diff -puN mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c
--- a/mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.959312604 -0700
+++ b/mm/vmscan.c	2020-06-29 16:34:38.965312604 -0700
@@ -1077,6 +1077,7 @@ static unsigned long shrink_page_list(st
 	LIST_HEAD(free_pages);
 	unsigned nr_reclaimed = 0;
 	unsigned pgactivate = 0;
+	int rc;
 
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
@@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
 			; /* try to reclaim the page below */
 		}
 
+		rc = migrate_demote_mapping(page);
+		/*
+		 * -ENOMEM on a THP may indicate either migration is
+		 * unsupported or there was not enough contiguous
+		 * space. Split the THP into base pages and retry the
+		 * head immediately. The tail pages will be considered
+		 * individually within the current loop's page list.
+		 */
+		if (rc == -ENOMEM && PageTransHuge(page) &&
+		    !split_huge_page_to_list(page, page_list))
+			rc = migrate_demote_mapping(page);
+
+		if (rc == MIGRATEPAGE_SUCCESS) {
+			unlock_page(page);
+			if (likely(put_page_testzero(page)))
+				goto free_it;
+			/*
+			 * Speculative reference will free this page,
+			 * so leave it off the LRU.
+			 */
+			nr_reclaimed++;
+			continue;
+		}
+
 		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.

From patchwork Mon Jun 29 23:45:10 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Hansen <dave.hansen@linux.intel.com>
X-Patchwork-Id: 11632861
Return-Path: <SRS0=KGKx=AK=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CDB9592A
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:40 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 966D520780
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:40 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 966D520780
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=linux.intel.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 29DAD6B0089; Mon, 29 Jun 2020 19:48:38 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 24D5A8D001B; Mon, 29 Jun 2020 19:48:38 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 18C636B0092; Mon, 29 Jun 2020 19:48:38 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0081.hostedemail.com
 [216.40.44.81])
	by kanga.kvack.org (Postfix) with ESMTP id 053666B0089
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 19:48:38 -0400 (EDT)
Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id BB7F6181AC9CB
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:37 +0000 (UTC)
X-FDA: 76983891474.20.grip89_0e0805626e73
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin20.hostedemail.com (Postfix) with ESMTP id 942E8180C07AF
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:37 +0000 (UTC)
X-Spam-Summary: 
 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30001:30034:30054:30064,0,RBL:192.55.52.93:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95
 62.18.0.100;04yg4fihd3kanim35d69ixgozxe9uopfco95q973453rgfmt5jaf5nyaz33apt1.dwpp4g3bhenp9azj71bc6k53o1c1eui6e1gfhae91a8w9bdcb518s331md6nrwg.1-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none
X-HE-Tag: grip89_0e0805626e73
X-Filterd-Recvd-Size: 4832
Received: from mga11.intel.com (mga11.intel.com [192.55.52.93])
	by imf29.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:36 +0000 (UTC)
IronPort-SDR: 
 U6A0RV23l+aiqYYDwWZA49Ub1EkoDurE4EAgEroool3m9rN8GDKmvFqE41uZcUNCwqpgnoohJp
 OlbydgQJmp7g==
X-IronPort-AV: E=McAfee;i="6000,8403,9666"; a="144290595"
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="144290595"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga001.jf.intel.com ([10.7.209.18])
  by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 29 Jun 2020 16:48:34 -0700
IronPort-SDR: 
 cSvOxbCWprVWDC9jeINzlhUbJCerkYsavNkCk9Uq9WEwYZfOjAVF34iAZDqm10x7jxTHjulg67
 rR543a07qo2A==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="355607603"
Received: from viggo.jf.intel.com (HELO localhost.localdomain)
 ([10.54.77.144])
  by orsmga001.jf.intel.com with ESMTP; 29 Jun 2020 16:48:34 -0700
Subject: [RFC][PATCH 4/8] mm/vmscan: add page demotion counter
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,Dave Hansen
 <dave.hansen@linux.intel.com>,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com
From: Dave Hansen <dave.hansen@linux.intel.com>
Date: Mon, 29 Jun 2020 16:45:10 -0700
References: <20200629234503.749E5340@viggo.jf.intel.com>
In-Reply-To: <20200629234503.749E5340@viggo.jf.intel.com>
Message-Id: <20200629234510.1BF23254@viggo.jf.intel.com>
X-Rspamd-Queue-Id: 942E8180C07AF
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam04
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Yang Shi <yang.shi@linux.alibaba.com>

Account the number of demoted pages into reclaim_state->nr_demoted.

Add pgdemote_kswapd and pgdemote_direct VM counters showed in
/proc/vmstat.

[ daveh:
   - __count_vm_events() a bit, and made them look at the THP
     size directly rather than getting data from migrate_pages()
]

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/include/linux/vm_event_item.h |    2 ++
 b/mm/migrate.c                  |   13 ++++++++++++-
 b/mm/vmscan.c                   |    1 +
 b/mm/vmstat.c                   |    2 ++
 4 files changed, 17 insertions(+), 1 deletion(-)

diff -puN include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter include/linux/vm_event_item.h
--- a/include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter	2020-06-29 16:34:40.332312601 -0700
+++ b/include/linux/vm_event_item.h	2020-06-29 16:34:40.342312601 -0700
@@ -32,6 +32,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 		PGREFILL,
 		PGSTEAL_KSWAPD,
 		PGSTEAL_DIRECT,
+		PGDEMOTE_KSWAPD,
+		PGDEMOTE_DIRECT,
 		PGSCAN_KSWAPD,
 		PGSCAN_DIRECT,
 		PGSCAN_DIRECT_THROTTLE,
diff -puN mm/migrate.c~mm-vmscan-add-page-demotion-counter mm/migrate.c
--- a/mm/migrate.c~mm-vmscan-add-page-demotion-counter	2020-06-29 16:34:40.334312601 -0700
+++ b/mm/migrate.c	2020-06-29 16:34:40.343312601 -0700
@@ -1187,6 +1187,7 @@ static struct page *alloc_demote_node_pa
 int migrate_demote_mapping(struct page *page)
 {
 	int next_nid = next_demotion_node(page_to_nid(page));
+	int ret;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageHuge(page), page);
@@ -1198,8 +1199,18 @@ int migrate_demote_mapping(struct page *
 		return -ENOMEM;
 
 	/* MIGRATE_ASYNC is the most light weight and never blocks.*/
-	return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
+	ret = __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
 				page, MIGRATE_ASYNC, MR_DEMOTION);
+
+	if (ret == MIGRATEPAGE_SUCCESS) {
+		int nr_demoted = hpage_nr_pages(page);
+		if (current_is_kswapd())
+			__count_vm_events(PGDEMOTE_KSWAPD, nr_demoted);
+		else
+			__count_vm_events(PGDEMOTE_DIRECT, nr_demoted);
+	}
+
+	return ret;
 }
 
 
diff -puN mm/vmscan.c~mm-vmscan-add-page-demotion-counter mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-add-page-demotion-counter	2020-06-29 16:34:40.336312601 -0700
+++ b/mm/vmscan.c	2020-06-29 16:34:40.344312601 -0700
@@ -140,6 +140,7 @@ struct scan_control {
 		unsigned int immediate;
 		unsigned int file_taken;
 		unsigned int taken;
+		unsigned int demoted;
 	} nr;
 
 	/* for recording the reclaimed slab by now */
diff -puN mm/vmstat.c~mm-vmscan-add-page-demotion-counter mm/vmstat.c
--- a/mm/vmstat.c~mm-vmscan-add-page-demotion-counter	2020-06-29 16:34:40.339312601 -0700
+++ b/mm/vmstat.c	2020-06-29 16:34:40.345312601 -0700
@@ -1198,6 +1198,8 @@ const char * const vmstat_text[] = {
 	"pgrefill",
 	"pgsteal_kswapd",
 	"pgsteal_direct",
+	"pgdemote_kswapd",
+	"pgdemote_direct",
 	"pgscan_kswapd",
 	"pgscan_direct",
 	"pgscan_direct_throttle",

From patchwork Mon Jun 29 23:45:12 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Hansen <dave.hansen@linux.intel.com>
X-Patchwork-Id: 11632863
Return-Path: <SRS0=KGKx=AK=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2556E92A
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:43 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id E385A20781
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:42 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E385A20781
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=linux.intel.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id B77938D001C; Mon, 29 Jun 2020 19:48:38 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B51B18D001B; Mon, 29 Jun 2020 19:48:38 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A897C8D001C; Mon, 29 Jun 2020 19:48:38 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0051.hostedemail.com
 [216.40.44.51])
	by kanga.kvack.org (Postfix) with ESMTP id 93BBB8D001B
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 19:48:38 -0400 (EDT)
Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 5C90D824805A
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:38 +0000 (UTC)
X-FDA: 76983891516.21.paste90_180f72226e73
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin21.hostedemail.com (Postfix) with ESMTP id 3CDBF180442C2
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:38 +0000 (UTC)
X-Spam-Summary: 
 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30012:30036:30045:30054:30064:30070:30080,0,RBL:134.134.136.31:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95
 62.18.0.100;04yf5qztgwam4k1whad36jk5db5g8ycrpu8puqpxof18p7mfmi9syp3mrniupin.jnzt7yedzp1dyd15u69bm4pywr7of66pusxwjygqe6mhwg1z1zfdq1cgi9bs9g1.6-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:23,LUA_SUMMARY:none
X-HE-Tag: paste90_180f72226e73
X-Filterd-Recvd-Size: 9206
Received: from mga06.intel.com (mga06.intel.com [134.134.136.31])
	by imf03.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:37 +0000 (UTC)
IronPort-SDR: 
 3Adn4UQMyBQq+cg0kiGChJ7HiuMCvAl3F30z5abSce4ArI1WqHuR3r0nNfGCTTnhNOuvSrLFOF
 Z5FlR49vds1g==
X-IronPort-AV: E=McAfee;i="6000,8403,9666"; a="207619649"
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="207619649"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga008.jf.intel.com ([10.7.209.65])
  by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 29 Jun 2020 16:48:36 -0700
IronPort-SDR: 
 9l3t0L3FlRaHvDVcoJ77HngWdERhkzcih1h5VWLn7eVK4RJzOSkeBYVW6kA/mTCYqKE3Kpzh8F
 hn55odvfiS5A==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="312203671"
Received: from viggo.jf.intel.com (HELO localhost.localdomain)
 ([10.54.77.144])
  by orsmga008.jf.intel.com with ESMTP; 29 Jun 2020 16:48:36 -0700
Subject: [RFC][PATCH 5/8] mm/numa: automatically generate node migration order
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,Dave Hansen
 <dave.hansen@linux.intel.com>,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com
From: Dave Hansen <dave.hansen@linux.intel.com>
Date: Mon, 29 Jun 2020 16:45:12 -0700
References: <20200629234503.749E5340@viggo.jf.intel.com>
In-Reply-To: <20200629234503.749E5340@viggo.jf.intel.com>
Message-Id: <20200629234512.F34EDC44@viggo.jf.intel.com>
X-Rspamd-Queue-Id: 3CDBF180442C2
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam02
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Dave Hansen <dave.hansen@linux.intel.com>

When memory fills up on a node, memory contents can be
automatically migrated to another node.  The biggest problems are
knowing when to migrate and to where the migration should be
targeted.

The most straightforward way to generate the "to where" list
would be to follow the page allocator fallback lists.  Those
lists already tell us if memory is full where to look next.  It
would also be logical to move memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes
appear in all the lists.  This would potentially lead to
migration cycles (A->B, B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a
separate node migration ordering.  But, reuse the same data used
to generate page allocator fallback in the first place:
find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now.  It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

The protocol for node_demotion[] access and writing is not
standard.  It has no specific locking and is intended to be read
locklessly.  Readers must take care to avoid observing changes
that appear incoherent.  This was done so that node_demotion[]
locking has no chance of becoming a bottleneck on large systems
with lots of CPUs in direct reclaim.

This code is unused for now.  It will be called later in the
series.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/mm/internal.h   |    1 
 b/mm/migrate.c    |  130 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 b/mm/page_alloc.c |    2 
 3 files changed, 131 insertions(+), 2 deletions(-)

diff -puN mm/internal.h~auto-setup-default-migration-path-from-firmware mm/internal.h
--- a/mm/internal.h~auto-setup-default-migration-path-from-firmware	2020-06-29 16:34:41.629312597 -0700
+++ b/mm/internal.h	2020-06-29 16:34:41.638312597 -0700
@@ -192,6 +192,7 @@ extern int user_min_free_kbytes;
 
 extern void zone_pcp_update(struct zone *zone);
 extern void zone_pcp_reset(struct zone *zone);
+extern int find_next_best_node(int node, nodemask_t *used_node_mask);
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
diff -puN mm/migrate.c~auto-setup-default-migration-path-from-firmware mm/migrate.c
--- a/mm/migrate.c~auto-setup-default-migration-path-from-firmware	2020-06-29 16:34:41.631312597 -0700
+++ b/mm/migrate.c	2020-06-29 16:34:41.639312597 -0700
@@ -1128,6 +1128,10 @@ out:
 	return rc;
 }
 
+/*
+ * Writes to this array occur without locking.  READ_ONCE()
+ * is recommended for readers.
+ */
 static int node_demotion[MAX_NUMNODES] = {[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
 
 /**
@@ -1141,7 +1145,13 @@ int next_demotion_node(int node)
 {
 	get_online_mems();
 	while (true) {
-		node = node_demotion[node];
+		/*
+		 * node_demotion[] is updated without excluding
+		 * this function from running.  READ_ONCE() avoids
+		 * 'node' checks reading different values from
+		 * node_demotion[].
+		 */
+		node = READ_ONCE(node_demotion[node]);
 		if (node == NUMA_NO_NODE)
 			break;
 		if (node_online(node))
@@ -3086,3 +3096,121 @@ void migrate_vma_finalize(struct migrate
 }
 EXPORT_SYMBOL(migrate_vma_finalize);
 #endif /* CONFIG_DEVICE_PRIVATE */
+
+/* Disable reclaim-based migration. */
+static void disable_all_migrate_targets(void)
+{
+	int node;
+
+	for_each_online_node(node)
+		node_demotion[node] = NUMA_NO_NODE;
+}
+
+/*
+ * Find an automatic demotion target for 'node'.
+ * Failing here is OK.  It might just indicate
+ * being at the end of a chain.
+ */
+static int establish_migrate_target(int node, nodemask_t *used)
+{
+	int migration_target;
+
+	/*
+	 * Can not set a migration target on a
+	 * node with it already set.
+	 *
+	 * No need for READ_ONCE() here since this
+	 * in the write path for node_demotion[].
+	 * This should be the only thread writing.
+	 */
+	if (node_demotion[node] != NUMA_NO_NODE)
+		return NUMA_NO_NODE;
+
+	migration_target = find_next_best_node(node, used);
+	if (migration_target == NUMA_NO_NODE)
+		return NUMA_NO_NODE;
+
+	node_demotion[node] = migration_target;
+
+	return migration_target;
+}
+
+/*
+ * When memory fills up on a node, memory contents can be
+ * automatically migrated to another node instead of
+ * discarded at reclaim.
+ *
+ * Establish a "migration path" which will start at nodes
+ * with CPUs and will follow the priorities used to build the
+ * page allocator zonelists.
+ *
+ * The difference here is that cycles must be avoided.  If
+ * node0 migrates to node1, then neither node1, nor anything
+ * node1 migrates to can migrate to node0.
+ *
+ * This function can run simultaneously with readers of
+ * node_demotion[].  However, it can not run simultaneously
+ * with itself.  Exclusion is provided by memory hotplug events
+ * being single-threaded.
+ */
+void set_migration_target_nodes(void)
+{
+	nodemask_t next_pass = NODE_MASK_NONE;
+	nodemask_t this_pass = NODE_MASK_NONE;
+	nodemask_t used_targets = NODE_MASK_NONE;
+	int node;
+
+	get_online_mems();
+	/*
+	 * Avoid any oddities like cycles that could occur
+	 * from changes in the topology.  This will leave
+	 * a momentary gap when migration is disabled.
+	 */
+	disable_all_migrate_targets();
+
+	/*
+	 * Ensure that the "disable" is visible across the system.
+	 * Readers will see either a combination of before+disable
+	 * state or disable+after.  They will never see before and
+	 * after state together.
+	 *
+	 * The before+after state together might have cycles and
+	 * could cause readers to do things like loop until this
+	 * function finishes.  This ensures they can only see a
+	 * single "bad" read and would, for instance, only loop
+	 * once.
+	 */
+	smp_wmb();
+
+	/*
+	 * Allocations go close to CPUs, first.  Assume that
+	 * the migration path starts at the nodes with CPUs.
+	 */
+	next_pass = node_states[N_CPU];
+again:
+	this_pass = next_pass;
+	next_pass = NODE_MASK_NONE;
+	/*
+	 * To avoid cycles in the migration "graph", ensure
+	 * that migration sources are not future targets by
+	 * setting them in 'used_targets'.
+	 *
+	 * But, do this only once per pass so that multiple
+	 * source nodes can share a target node.
+	 */
+	nodes_or(used_targets, used_targets, this_pass);
+	for_each_node_mask(node, this_pass) {
+		int target_node = establish_migrate_target(node, &used_targets);
+
+		if (target_node == NUMA_NO_NODE)
+			continue;
+
+		/* Visit targets from this pass in the next pass: */
+		node_set(target_node, next_pass);
+	}
+	/* Is another pass necessary? */
+	if (!nodes_empty(next_pass))
+		goto again;
+
+	put_online_mems();
+}
diff -puN mm/page_alloc.c~auto-setup-default-migration-path-from-firmware mm/page_alloc.c
--- a/mm/page_alloc.c~auto-setup-default-migration-path-from-firmware	2020-06-29 16:34:41.634312597 -0700
+++ b/mm/page_alloc.c	2020-06-29 16:34:41.641312597 -0700
@@ -5591,7 +5591,7 @@ static int node_load[MAX_NUMNODES];
  *
  * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
  */
-static int find_next_best_node(int node, nodemask_t *used_node_mask)
+int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	int n, val;
 	int min_val = INT_MAX;

From patchwork Mon Jun 29 23:45:14 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Hansen <dave.hansen@linux.intel.com>
X-Patchwork-Id: 11632867
Return-Path: <SRS0=KGKx=AK=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D4A346C1
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:46 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id ABE5220780
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:46 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org ABE5220780
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=linux.intel.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id A40038D001E; Mon, 29 Jun 2020 19:48:40 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A18018D001D; Mon, 29 Jun 2020 19:48:40 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 9532E8D001E; Mon, 29 Jun 2020 19:48:40 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0057.hostedemail.com
 [216.40.44.57])
	by kanga.kvack.org (Postfix) with ESMTP id 7A0DA8D001D
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 19:48:40 -0400 (EDT)
Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 3AA8D1EE6
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:40 +0000 (UTC)
X-FDA: 76983891600.17.order37_3a0c1f726e73
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin17.hostedemail.com (Postfix) with ESMTP id 188FD180D0180
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:40 +0000 (UTC)
X-Spam-Summary: 
 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30004:30054:30064,0,RBL:134.134.136.126:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95
 62.18.0.100;04y8tfwi784g8iqzpd47ymieg4bfoopggxr7d17brcpfk91ks3s48a17ysmmdog.xgjd48w7pad43xceemh1bki9y3rjfpppfxiiujg4kgcsqwc3fb1m73d3gbeaj7a.o-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:25,LUA_SUMMARY:none
X-HE-Tag: order37_3a0c1f726e73
X-Filterd-Recvd-Size: 5967
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
	by imf02.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:39 +0000 (UTC)
IronPort-SDR: 
 iCzRsYRvVHN7eibIKuKW+UrDSkea4p3S+luAjYXFQYkk0FjwfRJBXApjysItnGqnXLzYXzxzrc
 8sMeleIQWrhQ==
X-IronPort-AV: E=McAfee;i="6000,8403,9666"; a="133543888"
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="133543888"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga008.fm.intel.com ([10.253.24.58])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 29 Jun 2020 16:48:38 -0700
IronPort-SDR: 
 /t+O+/a6MlfHp1Yhr63Ua5EIAvxtCgHMVPfSkKC9m/M6gZKNX+1/r3YK55jrVWUvxVE2tUMyxg
 h20fXVfiF8/Q==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="266329045"
Received: from viggo.jf.intel.com (HELO localhost.localdomain)
 ([10.54.77.144])
  by fmsmga008.fm.intel.com with ESMTP; 29 Jun 2020 16:48:37 -0700
Subject: [RFC][PATCH 6/8] mm/vmscan: Consider anonymous pages without swap
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,Dave Hansen
 <dave.hansen@linux.intel.com>,kbusch@kernel.org,vishal.l.verma@intel.com,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com
From: Dave Hansen <dave.hansen@linux.intel.com>
Date: Mon, 29 Jun 2020 16:45:14 -0700
References: <20200629234503.749E5340@viggo.jf.intel.com>
In-Reply-To: <20200629234503.749E5340@viggo.jf.intel.com>
Message-Id: <20200629234514.CE5BA063@viggo.jf.intel.com>
X-Rspamd-Queue-Id: 188FD180D0180
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam05
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Keith Busch <keith.busch@intel.com>

Age and reclaim anonymous pages if a migration path is available. The
node has other recourses for inactive anonymous pages beyond swap,

#Signed-off-by: Keith Busch <keith.busch@intel.com>
Cc: Keith Busch <kbusch@kernel.org>
[vishal: fixup the migration->demotion rename]
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

Changes from Dave 06/2020:
 * rename reclaim_anon_pages()->can_reclaim_anon_pages()

---

 b/include/linux/node.h |    9 +++++++++
 b/mm/vmscan.c          |   32 +++++++++++++++++++++++++++-----
 2 files changed, 36 insertions(+), 5 deletions(-)

diff -puN include/linux/node.h~0009-mm-vmscan-Consider-anonymous-pages-without-swap include/linux/node.h
--- a/include/linux/node.h~0009-mm-vmscan-Consider-anonymous-pages-without-swap	2020-06-29 16:34:42.861312594 -0700
+++ b/include/linux/node.h	2020-06-29 16:34:42.867312594 -0700
@@ -180,4 +180,13 @@ static inline void register_hugetlbfs_wi
 
 #define to_node(device) container_of(device, struct node, dev)
 
+#ifdef CONFIG_MIGRATION
+extern int next_demotion_node(int node);
+#else
+static inline int next_demotion_node(int node)
+{
+	return NUMA_NO_NODE;
+}
+#endif
+
 #endif /* _LINUX_NODE_H_ */
diff -puN mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap mm/vmscan.c
--- a/mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap	2020-06-29 16:34:42.863312594 -0700
+++ b/mm/vmscan.c	2020-06-29 16:34:42.868312594 -0700
@@ -288,6 +288,26 @@ static bool writeback_throttling_sane(st
 }
 #endif
 
+static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
+					  int node_id)
+{
+	/* Always age anon pages when we have swap */
+	if (memcg == NULL) {
+		if (get_nr_swap_pages() > 0)
+			return true;
+	} else {
+		if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
+			return true;
+	}
+
+	/* Also age anon pages if we can auto-migrate them */
+	if (next_demotion_node(node_id) >= 0)
+		return true;
+
+	/* No way to reclaim anon pages */
+	return false;
+}
+
 /*
  * This misses isolated pages which are not accounted for to save counters.
  * As the data only determines if reclaim or compaction continues, it is
@@ -299,7 +319,7 @@ unsigned long zone_reclaimable_pages(str
 
 	nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
 		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
+	if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
 		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
 			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
 
@@ -2267,7 +2287,7 @@ static void get_scan_count(struct lruvec
 	enum lru_list lru;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
-	if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
+	if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2572,7 +2592,9 @@ static void shrink_lruvec(struct lruvec
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
+	if (can_reclaim_anon_pages(lruvec_memcg(lruvec),
+			       lruvec_pgdat(lruvec)->node_id) &&
+	    inactive_is_low(lruvec, LRU_INACTIVE_ANON))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
 }
@@ -2642,7 +2664,7 @@ static inline bool should_continue_recla
 	 */
 	pages_for_compaction = compact_gap(sc->order);
 	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
+	if (can_reclaim_anon_pages(NULL, pgdat->node_id))
 		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 
 	return inactive_lru_pages > pages_for_compaction;
@@ -3395,7 +3417,7 @@ static void age_active_anon(struct pglis
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
-	if (!total_swap_pages)
+	if (!can_reclaim_anon_pages(NULL, pgdat->node_id))
 		return;
 
 	lruvec = mem_cgroup_lruvec(NULL, pgdat);

From patchwork Mon Jun 29 23:45:15 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Hansen <dave.hansen@linux.intel.com>
X-Patchwork-Id: 11632869
Return-Path: <SRS0=KGKx=AK=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6745B92A
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:49 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 3E68820780
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:49 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3E68820780
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=linux.intel.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 780C48D001F; Mon, 29 Jun 2020 19:48:42 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 758C18D001D; Mon, 29 Jun 2020 19:48:42 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5834A8D001F; Mon, 29 Jun 2020 19:48:42 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0156.hostedemail.com
 [216.40.44.156])
	by kanga.kvack.org (Postfix) with ESMTP id 442288D001D
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 19:48:42 -0400 (EDT)
Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 0BEDB824805A
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:42 +0000 (UTC)
X-FDA: 76983891684.18.fifth73_4c1300626e73
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin18.hostedemail.com (Postfix) with ESMTP id DA131100EDBED
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:41 +0000 (UTC)
X-Spam-Summary: 
 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30004:30005:30034:30054:30064:30090:30091,0,RBL:192.55.52.115:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95
 62.18.0.100;04yf69snu88i6o1fcuy3fmy6578dxocz1j5xhmnd7y3z64qp4p6j4mo474gi3jk.xhne5hu5zkdzqscx97jm7zp31rxbbcajytoxgccgesb5gysw44i1bqfig15481b.e-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:25,LUA_SUMMARY:none
X-HE-Tag: fifth73_4c1300626e73
X-Filterd-Recvd-Size: 6996
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
	by imf07.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:40 +0000 (UTC)
IronPort-SDR: 
 Y5Unvc0zdIMaHYIRYI3VMDZVAHheb/BZ/vCRNn1b5z5HaKewZhno7d022aD1vdiAbZV33U8jrQ
 8py0eeEOBCWQ==
X-IronPort-AV: E=McAfee;i="6000,8403,9666"; a="145172095"
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="145172095"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
  by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 29 Jun 2020 16:48:39 -0700
IronPort-SDR: 
 rCyOsg2HbthsYWqbWWTk+YsX5FV72qvaomAdJ0MvVZkfyMoI9ssBhKfhC6nn4RVe1WzlrUeLA0
 DLzER6PemvbQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="480965753"
Received: from viggo.jf.intel.com (HELO localhost.localdomain)
 ([10.54.77.144])
  by fmsmga005.fm.intel.com with ESMTP; 29 Jun 2020 16:48:39 -0700
Subject: [RFC][PATCH 7/8] mm/vmscan: never demote for memcg reclaim
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,Dave Hansen
 <dave.hansen@linux.intel.com>,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com
From: Dave Hansen <dave.hansen@linux.intel.com>
Date: Mon, 29 Jun 2020 16:45:15 -0700
References: <20200629234503.749E5340@viggo.jf.intel.com>
In-Reply-To: <20200629234503.749E5340@viggo.jf.intel.com>
Message-Id: <20200629234515.B11A021E@viggo.jf.intel.com>
X-Rspamd-Queue-Id: DA131100EDBED
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam01
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Dave Hansen <dave.hansen@linux.intel.com>

Global reclaim aims to reduce the amount of memory used on
a given node or set of nodes.  Migrating pages to another
node serves this purpose.

memcg reclaim is different.  Its goal is to reduce the
total memory consumption of the entire memcg, across all
nodes.  Migration does not assist memcg reclaim because
it just moves page contents between nodes rather than
actually reducing memory consumption.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/mm/vmscan.c |   61 +++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 19 deletions(-)

diff -puN mm/vmscan.c~never-demote-for-memcg-reclaim mm/vmscan.c
--- a/mm/vmscan.c~never-demote-for-memcg-reclaim	2020-06-29 16:34:44.018312591 -0700
+++ b/mm/vmscan.c	2020-06-29 16:34:44.023312591 -0700
@@ -289,7 +289,8 @@ static bool writeback_throttling_sane(st
 #endif
 
 static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
-					  int node_id)
+					  int node_id,
+					  struct scan_control *sc)
 {
 	/* Always age anon pages when we have swap */
 	if (memcg == NULL) {
@@ -300,8 +301,14 @@ static inline bool can_reclaim_anon_page
 			return true;
 	}
 
-	/* Also age anon pages if we can auto-migrate them */
-	if (next_demotion_node(node_id) >= 0)
+	/*
+	 * Also age anon pages if we can auto-migrate them.
+	 *
+	 * Migrating a page does not reduce comsumption of a
+	 * memcg so should not be performed when in memcg
+	 * reclaim.
+	 */
+	if ((sc && cgroup_reclaim(sc)) && (next_demotion_node(node_id) >= 0))
 		return true;
 
 	/* No way to reclaim anon pages */
@@ -319,7 +326,7 @@ unsigned long zone_reclaimable_pages(str
 
 	nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
 		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
-	if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
+	if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
 		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
 			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
 
@@ -1084,6 +1091,32 @@ static void page_check_dirty_writeback(s
 		mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
 }
 
+
+static int shrink_do_demote_mapping(struct page *page,
+				    struct list_head *page_list,
+				    struct scan_control *sc)
+{
+	int rc;
+
+	/* It is pointless to do demotion in memcg reclaim */
+	if (cgroup_reclaim(sc))
+		return -ENOTSUPP;
+
+	rc = migrate_demote_mapping(page);
+	/*
+	 * -ENOMEM on a THP may indicate either migration is
+	 * unsupported or there was not enough contiguous
+	 * space. Split the THP into base pages and retry the
+	 * head immediately. The tail pages will be considered
+	 * individually within the current loop's page list.
+	 */
+	if (rc == -ENOMEM && PageTransHuge(page) &&
+	    !split_huge_page_to_list(page, page_list))
+		rc = migrate_demote_mapping(page);
+
+	return rc;
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -1251,17 +1284,7 @@ static unsigned long shrink_page_list(st
 			; /* try to reclaim the page below */
 		}
 
-		rc = migrate_demote_mapping(page);
-		/*
-		 * -ENOMEM on a THP may indicate either migration is
-		 * unsupported or there was not enough contiguous
-		 * space. Split the THP into base pages and retry the
-		 * head immediately. The tail pages will be considered
-		 * individually within the current loop's page list.
-		 */
-		if (rc == -ENOMEM && PageTransHuge(page) &&
-		    !split_huge_page_to_list(page, page_list))
-			rc = migrate_demote_mapping(page);
+		rc = shrink_do_demote_mapping(page, page_list, sc);
 
 		if (rc == MIGRATEPAGE_SUCCESS) {
 			unlock_page(page);
@@ -2287,7 +2310,7 @@ static void get_scan_count(struct lruvec
 	enum lru_list lru;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
-	if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
+	if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2593,7 +2616,7 @@ static void shrink_lruvec(struct lruvec
 	 * rebalance the anon lru active/inactive ratio.
 	 */
 	if (can_reclaim_anon_pages(lruvec_memcg(lruvec),
-			       lruvec_pgdat(lruvec)->node_id) &&
+			       lruvec_pgdat(lruvec)->node_id, sc) &&
 	    inactive_is_low(lruvec, LRU_INACTIVE_ANON))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
@@ -2664,7 +2687,7 @@ static inline bool should_continue_recla
 	 */
 	pages_for_compaction = compact_gap(sc->order);
 	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-	if (can_reclaim_anon_pages(NULL, pgdat->node_id))
+	if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
 		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 
 	return inactive_lru_pages > pages_for_compaction;
@@ -3417,7 +3440,7 @@ static void age_active_anon(struct pglis
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
-	if (!can_reclaim_anon_pages(NULL, pgdat->node_id))
+	if (!can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
 		return;
 
 	lruvec = mem_cgroup_lruvec(NULL, pgdat);

From patchwork Mon Jun 29 23:45:17 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Hansen <dave.hansen@linux.intel.com>
X-Patchwork-Id: 11632871
Return-Path: <SRS0=KGKx=AK=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A062C92A
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:51 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 6D7AF20780
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 29 Jun 2020 23:48:51 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6D7AF20780
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=linux.intel.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 955DD8D0020; Mon, 29 Jun 2020 19:48:43 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8DCCB8D001D; Mon, 29 Jun 2020 19:48:43 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7A9208D0020; Mon, 29 Jun 2020 19:48:43 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0231.hostedemail.com
 [216.40.44.231])
	by kanga.kvack.org (Postfix) with ESMTP id 5ACA28D001D
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 19:48:43 -0400 (EDT)
Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 23974180AD806
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:43 +0000 (UTC)
X-FDA: 76983891726.26.trick90_240d32826e73
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin26.hostedemail.com (Postfix) with ESMTP id 0057C1804B66A
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:42 +0000 (UTC)
X-Spam-Summary: 
 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30036:30045:30051:30054:30064:30070,0,RBL:134.134.136.31:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95
 62.18.0.100;04yf6sxiqr79smubd4sb73yf45gbbop1ucyremqjreq5acb4j1fxuztk84bsas7.n4jqgp5frpucbrhtbsbg4air8ynxypi4xh6ex3rzgzazh6z6f4k3noebnyxpc6d.6-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none
X-HE-Tag: trick90_240d32826e73
X-Filterd-Recvd-Size: 8211
Received: from mga06.intel.com (mga06.intel.com [134.134.136.31])
	by imf03.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 23:48:42 +0000 (UTC)
IronPort-SDR: 
 59rf2M1MjmaI2ohrC8xn679e+a8Ec28iJHpAgx7geFq2rGI9OjrzyK1ueUL4P9NaC+Rmi2Tywh
 o4SChwZ6Tw8Q==
X-IronPort-AV: E=McAfee;i="6000,8403,9666"; a="207619662"
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="207619662"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga004.fm.intel.com ([10.253.24.48])
  by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 29 Jun 2020 16:48:41 -0700
IronPort-SDR: 
 BMt7i1YF9CwoMQtjCi86Equ2aZvBB01RYfufR0Sm20hL7jkt2UdtHQPStv3hgaXXl098R04My0
 S223ofrVtQVw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.75,296,1589266800";
   d="scan'208";a="303239904"
Received: from viggo.jf.intel.com (HELO localhost.localdomain)
 ([10.54.77.144])
  by fmsmga004.fm.intel.com with ESMTP; 29 Jun 2020 16:48:41 -0700
Subject: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based
 migration
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,Dave Hansen
 <dave.hansen@linux.intel.com>,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com
From: Dave Hansen <dave.hansen@linux.intel.com>
Date: Mon, 29 Jun 2020 16:45:17 -0700
References: <20200629234503.749E5340@viggo.jf.intel.com>
In-Reply-To: <20200629234503.749E5340@viggo.jf.intel.com>
Message-Id: <20200629234517.A7EC4BD3@viggo.jf.intel.com>
X-Rspamd-Queue-Id: 0057C1804B66A
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam01
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Dave Hansen <dave.hansen@linux.intel.com>

Some method is obviously needed to enable reclaim-based migration.

Just like traditional autonuma, there will be some workloads that
will benefit like workloads with more "static" configurations where
hot pages stay hot and cold pages stay cold.  If pages come and go
from the hot and cold sets, the benefits of this approach will be
more limited.

The benefits are truly workload-based and *not* hardware-based.
We do not believe that there is a viable threshold where certain
hardware configurations should have this mechanism enabled while
others do not.

To be conservative, earlier work defaulted to disable reclaim-
based migration and did not include a mechanism to enable it.
This propses extending the existing "zone_reclaim_mode" (now
now really node_reclaim_mode) as a method to enable it.

We are open to any alternative that allows end users to enable
this mechanism or disable it it workload harm is detected (just
like traditional autonuma).

The implementation here is pretty simple and entirely unoptimized.
On any memory hotplug events, assume that a node was added or
removed and recalculate all migration targets.  This ensures that
the node_demotion[] array is always ready to be used in case the
new reclaim mode is enabled.  This recalculation is far from
optimal, most glaringly that it does not even attempt to figure
out if nodes are actually coming or going.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/Documentation/admin-guide/sysctl/vm.rst |    9 ++++
 b/mm/migrate.c                            |   61 +++++++++++++++++++++++++++++-
 b/mm/vmscan.c                             |    7 +--
 3 files changed, 73 insertions(+), 4 deletions(-)

diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion	2020-06-29 16:35:01.012312549 -0700
+++ b/Documentation/admin-guide/sysctl/vm.rst	2020-06-29 16:35:01.021312549 -0700
@@ -941,6 +941,7 @@ This is value OR'ed together of
 1	(bit currently ignored)
 2	Zone reclaim writes dirty pages out
 4	Zone reclaim swaps pages
+8	Zone reclaim migrates pages
 =	===================================
 
 zone_reclaim_mode is disabled by default.  For file servers or workloads
@@ -965,3 +966,11 @@ of other processes running on other node
 Allowing regular swap effectively restricts allocations to the local
 node unless explicitly overridden by memory policies or cpuset
 configurations.
+
+Page migration during reclaim is intended for systems with tiered memory
+configurations.  These systems have multiple types of memory with varied
+performance characteristics instead of plain NUMA systems where the same
+kind of memory is found at varied distances.  Allowing page migration
+during reclaim enables these systems to migrate pages from fast tiers to
+slow tiers when the fast tier is under pressure.  This migration is
+performed before swap.
diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
--- a/mm/migrate.c~enable-numa-demotion	2020-06-29 16:35:01.015312549 -0700
+++ b/mm/migrate.c	2020-06-29 16:35:01.022312549 -0700
@@ -49,6 +49,7 @@
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
 #include <linux/oom.h>
+#include <linux/memory.h>
 
 #include <asm/tlbflush.h>
 
@@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void)
 	 * Avoid any oddities like cycles that could occur
 	 * from changes in the topology.  This will leave
 	 * a momentary gap when migration is disabled.
+	 *
+	 * This is superfluous for memory offlining since
+	 * MEM_GOING_OFFLINE does it independently, but it
+	 * does not hurt to do it a second time.
 	 */
 	disable_all_migrate_targets();
 
@@ -3211,6 +3216,60 @@ again:
 	/* Is another pass necessary? */
 	if (!nodes_empty(next_pass))
 		goto again;
+}
 
-	put_online_mems();
+/*
+ * React to hotplug events that might online or offline
+ * NUMA nodes.
+ *
+ * This leaves migrate-on-reclaim transiently disabled
+ * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
+ * This runs whether RECLAIM_MIGRATE is enabled or not.
+ * That ensures that the user can turn RECLAIM_MIGRATE
+ * without needing to recalcuate migration targets.
+ */
+#if defined(CONFIG_MEMORY_HOTPLUG)
+static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
+						 unsigned long action, void *arg)
+{
+	switch (action) {
+	case MEM_GOING_OFFLINE:
+		/*
+		 * Make sure there are not transient states where
+		 * an offline node is a migration target.  This
+		 * will leave migration disabled until the offline
+		 * completes and the MEM_OFFLINE case below runs.
+		 */
+		disable_all_migrate_targets();
+		break;
+	case MEM_OFFLINE:
+	case MEM_ONLINE:
+		/*
+		 * Recalculate the target nodes once the node
+		 * reaches its final state (online or offline).
+		 */
+		set_migration_target_nodes();
+		break;
+	case MEM_CANCEL_OFFLINE:
+		/*
+		 * MEM_GOING_OFFLINE disabled all the migration
+		 * targets.  Reenable them.
+		 */
+		set_migration_target_nodes();
+		break;
+	case MEM_GOING_ONLINE:
+	case MEM_CANCEL_ONLINE:
+		break;
+	}
+
+	return notifier_from_errno(0);
 }
+
+static int __init migrate_on_reclaim_init(void)
+{
+	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
+	return 0;
+}
+late_initcall(migrate_on_reclaim_init);
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
--- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
+++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
@@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
  * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
  * ABI.  New bits are OK, but existing bits can never change.
  */
-#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
-#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
+#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
+#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
+#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
+#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
 
 /*
  * Priority for NODE_RECLAIM. This determines the fraction of pages