From patchwork Sun Dec 12 11:31:57 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Baolin Wang <baolin.wang@linux.alibaba.com>
X-Patchwork-Id: 12672215
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4E3A3C433F5
	for <linux-mm@archiver.kernel.org>; Sun, 12 Dec 2021 11:32:37 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8DFF26B0073; Sun, 12 Dec 2021 06:32:26 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 891596B0075; Sun, 12 Dec 2021 06:32:26 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7738B6B0074; Sun, 12 Dec 2021 06:32:26 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0102.hostedemail.com
 [216.40.44.102])
	by kanga.kvack.org (Postfix) with ESMTP id 68F876B0071
	for <linux-mm@kvack.org>; Sun, 12 Dec 2021 06:32:26 -0500 (EST)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 2488B180EB6F1
	for <linux-mm@kvack.org>; Sun, 12 Dec 2021 11:32:16 +0000 (UTC)
X-FDA: 78908928672.27.7102FBA
Received: from out30-42.freemail.mail.aliyun.com
 (out30-42.freemail.mail.aliyun.com [115.124.30.42])
	by imf28.hostedemail.com (Postfix) with ESMTP id 2D325C000B
	for <linux-mm@kvack.org>; Sun, 12 Dec 2021 11:32:14 +0000 (UTC)
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R371e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04395;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0V-JhNbu_1639308730;
Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com
 fp:SMTPD_---0V-JhNbu_1639308730)
          by smtp.aliyun-inc.com(127.0.0.1);
          Sun, 12 Dec 2021 19:32:11 +0800
From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: akpm@linux-foundation.org,
	ying.huang@intel.com,
	dave.hansen@linux.intel.com
Cc: ziy@nvidia.com,
	shy828301@gmail.com,
	baolin.wang@linux.alibaba.com,
	zhongjiang-ali@linux.alibaba.com,
	xlpang@linux.alibaba.com,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: [RFC PATCH 1/4] mm: Add speculative numa fault support
Date: Sun, 12 Dec 2021 19:31:57 +0800
Message-Id: 
 <0ec3e9ce4b564bee3883b6141b1f9f2498188002.1639306956.git.baolin.wang@linux.alibaba.com>
X-Mailer: git-send-email 1.8.3.1
In-Reply-To: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
References: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
In-Reply-To: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
References: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 2D325C000B
X-Stat-Signature: xpn714maghsw7wuy7mbodtjakbddbh3f
Authentication-Results: imf28.hostedemail.com;
	dkim=none;
	spf=pass (imf28.hostedemail.com: domain of baolin.wang@linux.alibaba.com
 designates 115.124.30.42 as permitted sender)
 smtp.mailfrom=baolin.wang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=alibaba.com
X-HE-Tag: 1639308734-19413
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Some workloads access a set of data entities will follow the data locality,
also known as locality of reference, which means the probability of accessing
some data soon after some nearby data has been accessed.

On some systems with different memory types, which will rely on the numa
balancing to promote slow hot memory to fast memory to improve performance.
So we can promote several sequential pages on slow memory in advance according
to the data locality for some workloads to improve the performance.

Thus this patch supports speculative numa fault mechanism to help to
migrate suitable pages in advance to improve the performance. And now
the basic concept of the speculative numa fault is that, it will add a
new member for each VMA to record the numa fault window, which will record
the last numa fault address and the pages need to be migrated to the target
node. So when numa fault occurs, we will check the last numa fault window
for current VMA to check if it is a sequential stream accessing, if yes, we
can expand the numa fault window; if not, we can reduce the numa fault
winow or close the speculative numa fault to avoid doing unnecessary
migration.

Testing with mysql can show about 6% performance improved as below.

Machine: 16 CPUs, 64G DRAM, 256G AEP

sysbench /usr/share/sysbench/tests/include/oltp_legacy/oltp.lua
--mysql-user=root --mysql-password=root --oltp-test-mode=complex
--oltp-tables-count=80 --oltp-table-size=5000000 --threads=20 --time=600
--report-interval=10 prepare/run

No speculative numa fault:
    queries performed:
        read:                            33039860
        write:                           9439960
        other:                           4719980
        total:                           47199800
    transactions:                        2359990 (3933.28 per sec.)
    queries:                             47199800 (78665.50 per sec.)

Speculative numa fault:
    queries performed:
        read:                            34896862
        write:                           9970532
        other:                           4985266
        total:                           49852660
    transactions:                        2492633 (4154.35 per sec.)
    queries:                             49852660 (83086.94 per sec.)

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/mm_types.h |   3 +
 mm/memory.c              | 165 ++++++++++++++++++++++++++++++++++++---
 2 files changed, 159 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 449b6eafc695..8d8381e9aec9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -474,6 +474,9 @@ struct vm_area_struct {
 #endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+	atomic_long_t numafault_ahead_info;
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 } __randomize_layout;
diff --git a/mm/memory.c b/mm/memory.c
index 2291417783bc..2c9ed63e4e23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -74,6 +74,8 @@
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/sysctl.h>
+#include <linux/pagewalk.h>
+#include <linux/page_idle.h>
 
 #include <trace/events/kmem.h>
 
@@ -4315,16 +4317,156 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 	return mpol_misplaced(page, vma, addr);
 }
 
+static bool try_next_numa_page(struct vm_fault *vmf, unsigned int win_pages,
+			       unsigned long *fault_addr)
+{
+	unsigned long next_fault_addr = *fault_addr + PAGE_SIZE;
+	unsigned long numa_fault_end = vmf->address + (win_pages + 1) * PAGE_SIZE;
+
+	if (next_fault_addr > numa_fault_end)
+		return false;
+
+	*fault_addr = next_fault_addr;
+	vmf->pte = pte_offset_map(vmf->pmd, next_fault_addr);
+	vmf->orig_pte = *vmf->pte;
+	if (pte_protnone(vmf->orig_pte))
+		return true;
+
+	return false;
+}
+
+#define NUMA_FAULT_AHEAD_DEFUALT	2
+#define NUMA_FAULT_EXPAND_STEP		1
+#define NUMA_FAULT_REDUCE_STEP		2
+#define GET_NUMA_FAULT_INFO(vma)	\
+	(atomic_long_read(&(vma)->numafault_ahead_info))
+#define NUMA_FAULT_WINDOW_START(v)	((v) & PAGE_MASK)
+#define NUMA_FAULT_WINDOW_SIZE_MASK	((1UL << PAGE_SHIFT) - 1)
+#define NUMA_FAULT_WINDOW_SIZE(v)	((v) & NUMA_FAULT_WINDOW_SIZE_MASK)
+#define NUMA_FAULT_INFO(addr, win)	\
+	(((addr) & PAGE_MASK) |		\
+	((win) & NUMA_FAULT_WINDOW_SIZE_MASK))
+
+static inline unsigned int numa_fault_max_pages(struct vm_area_struct *vma,
+						unsigned long fault_address)
+{
+	unsigned long pmd_end_addr = (fault_address & PMD_MASK) + PMD_SIZE;
+	unsigned long max_fault_addr = min_t(unsigned long, pmd_end_addr,
+					    vma->vm_end);
+
+	return (max_fault_addr - fault_address - 1) >> PAGE_SHIFT;
+}
+
+static unsigned int adjust_numa_fault_window(struct vm_area_struct *vma,
+					     unsigned long fault_address)
+{
+	unsigned long numafault_ahead = GET_NUMA_FAULT_INFO(vma);
+        unsigned long prev_start = NUMA_FAULT_WINDOW_START(numafault_ahead);
+        unsigned int prev_pages = NUMA_FAULT_WINDOW_SIZE(numafault_ahead);
+	unsigned long win_start;
+	unsigned int win_pages, max_fault_pages;
+
+	win_start = fault_address + PAGE_SIZE;
+
+	/*
+	 * Start accessing the VMA, then just open a small window to try.
+	 */
+	if (!numafault_ahead) {
+		win_pages = NUMA_FAULT_AHEAD_DEFUALT;
+		goto out;
+	}
+
+	/*
+	 * If last numa fault window was close, we should check if current fault
+	 * address is continue with previous fault addess before opening the
+	 * new numa fault window.
+	 */
+	if (!prev_pages) {
+		if (fault_address == prev_start ||
+		    fault_address == prev_start + PAGE_SIZE)
+			win_pages = NUMA_FAULT_AHEAD_DEFUALT;
+		else
+			win_pages = 0;
+
+		goto out;
+	}
+
+	/*
+	 * TODO: need check the fault addess is occured before the last numa
+	 * fault window.
+	 */
+	if (fault_address >= prev_start) {
+		unsigned long prev_end = prev_start + prev_pages * PAGE_SIZE;
+
+		/*
+		 * Continue with the previous numa fault window, then assume
+		 * it is a sequential accessing, which need expand the numa fault
+		 * window.
+		 */
+		if (fault_address == prev_end ||
+		    fault_address == prev_end + PAGE_SIZE) {
+			win_pages = prev_pages + NUMA_FAULT_EXPAND_STEP;
+			goto validate_out;
+		} else if (fault_address < prev_end) {
+			/*
+			 * If current fault address is in the range of last numa
+			 * fault window, which means the pages in last numa fault
+			 * window were not all migrated successfully, so just
+			 * keep current size of last numa fault window to try
+			 * again, since last numa fault window speculation may
+			 * be on the correct way.
+			 */
+			win_pages = prev_pages;
+			goto validate_out;
+		}
+	}
+
+	/*
+	 * Until now assume random accessing, reduce the numa fault window
+	 * by step.
+	 */
+	if (prev_pages <= NUMA_FAULT_REDUCE_STEP) {
+		win_pages = 0;
+		goto out;
+	} else {
+		win_pages = prev_pages - NUMA_FAULT_REDUCE_STEP;
+	}
+
+validate_out:
+	/*
+	 * Make sure the size of ahead numa fault address is less than the
+	 * size of current VMA or PMD.
+	 */
+	max_fault_pages = numa_fault_max_pages(vma, fault_address);
+	if (win_pages > max_fault_pages)
+		win_pages = max_fault_pages;
+
+out:
+	atomic_long_set(&vma->numafault_ahead_info,
+			NUMA_FAULT_INFO(win_start, win_pages));
+	return win_pages;
+}
+
 static vm_fault_t do_numa_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page = NULL;
-	int page_nid = NUMA_NO_NODE;
+	int page_nid;
 	int last_cpupid;
 	int target_nid;
 	pte_t pte, old_pte;
-	bool was_writable = pte_savedwrite(vmf->orig_pte);
-	int flags = 0;
+	bool was_writable;
+	int flags;
+	unsigned long fault_address = vmf->address;
+	unsigned int win_pages;
+
+	/*  Try to speculate the numa fault window for current VMA. */
+	win_pages = adjust_numa_fault_window(vma, fault_address);
+
+try_next:
+	was_writable = pte_savedwrite(vmf->orig_pte);
+	flags = 0;
+	page_nid = NUMA_NO_NODE;
 
 	/*
 	 * The "pte" at this point cannot be used safely without
@@ -4342,7 +4484,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	old_pte = ptep_get(vmf->pte);
 	pte = pte_modify(old_pte, vma->vm_page_prot);
 
-	page = vm_normal_page(vma, vmf->address, pte);
+	page = vm_normal_page(vma, fault_address, pte);
 	if (!page)
 		goto out_map;
 
@@ -4378,7 +4520,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		last_cpupid = (-1 & LAST_CPUPID_MASK);
 	else
 		last_cpupid = page_cpupid_last(page);
-	target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid,
+	target_nid = numa_migrate_prep(page, vma, fault_address, page_nid,
 			&flags);
 	if (target_nid == NUMA_NO_NODE) {
 		put_page(page);
@@ -4392,7 +4534,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		flags |= TNF_MIGRATED;
 	} else {
 		flags |= TNF_MIGRATE_FAIL;
-		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+		vmf->pte = pte_offset_map(vmf->pmd, fault_address);
 		spin_lock(vmf->ptl);
 		if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -4404,19 +4546,24 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 out:
 	if (page_nid != NUMA_NO_NODE)
 		task_numa_fault(last_cpupid, page_nid, 1, flags);
+
+	if ((flags & TNF_MIGRATED) && (win_pages > 0) &&
+	    try_next_numa_page(vmf, win_pages, &fault_address))
+		goto try_next;
+
 	return 0;
 out_map:
 	/*
 	 * Make it present again, depending on how arch implements
 	 * non-accessible ptes, some can allow access by kernel mode.
 	 */
-	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
+	old_pte = ptep_modify_prot_start(vma, fault_address, vmf->pte);
 	pte = pte_modify(old_pte, vma->vm_page_prot);
 	pte = pte_mkyoung(pte);
 	if (was_writable)
 		pte = pte_mkwrite(pte);
-	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+	ptep_modify_prot_commit(vma, fault_address, vmf->pte, old_pte, pte);
+	update_mmu_cache(vma, fault_address, vmf->pte);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	goto out;
 }

From patchwork Sun Dec 12 11:31:58 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Baolin Wang <baolin.wang@linux.alibaba.com>
X-Patchwork-Id: 12672219
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 50875C433F5
	for <linux-mm@archiver.kernel.org>; Sun, 12 Dec 2021 11:33:30 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 007C46B0074; Sun, 12 Dec 2021 06:32:27 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id DDAF26B007B; Sun, 12 Dec 2021 06:32:26 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CA3446B0078; Sun, 12 Dec 2021 06:32:26 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0093.hostedemail.com
 [216.40.44.93])
	by kanga.kvack.org (Postfix) with ESMTP id A766C6B007B
	for <linux-mm@kvack.org>; Sun, 12 Dec 2021 06:32:26 -0500 (EST)
Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 658A28249980
	for <linux-mm@kvack.org>; Sun, 12 Dec 2021 11:32:16 +0000 (UTC)
X-FDA: 78908928672.13.13F20C2
Received: from out30-42.freemail.mail.aliyun.com
 (out30-42.freemail.mail.aliyun.com [115.124.30.42])
	by imf14.hostedemail.com (Postfix) with ESMTP id 656BA100007
	for <linux-mm@kvack.org>; Sun, 12 Dec 2021 11:32:14 +0000 (UTC)
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04426;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0V-JZpjp_1639308731;
Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com
 fp:SMTPD_---0V-JZpjp_1639308731)
          by smtp.aliyun-inc.com(127.0.0.1);
          Sun, 12 Dec 2021 19:32:12 +0800
From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: akpm@linux-foundation.org,
	ying.huang@intel.com,
	dave.hansen@linux.intel.com
Cc: ziy@nvidia.com,
	shy828301@gmail.com,
	baolin.wang@linux.alibaba.com,
	zhongjiang-ali@linux.alibaba.com,
	xlpang@linux.alibaba.com,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: [RFC PATCH 2/4] mm: Add a debug interface to control the range of
 speculative numa fault
Date: Sun, 12 Dec 2021 19:31:58 +0800
Message-Id: 
 <913a8a5282d265dc771309ca552c9c62c247c2b0.1639306956.git.baolin.wang@linux.alibaba.com>
X-Mailer: git-send-email 1.8.3.1
In-Reply-To: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
References: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
In-Reply-To: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
References: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 656BA100007
X-Stat-Signature: sp8kw93so6hdhfh398hkbodqgokw3sfz
Authentication-Results: imf14.hostedemail.com;
	dkim=none;
	spf=pass (imf14.hostedemail.com: domain of baolin.wang@linux.alibaba.com
 designates 115.124.30.42 as permitted sender)
 smtp.mailfrom=baolin.wang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=alibaba.com
X-HE-Tag: 1639308734-599959
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Add a debug interface to control the range of speculative numa fault,
which can be used to tuning the performance or event close the speculative
numa fault window for some workloads.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/memory.c | 46 +++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 43 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2c9ed63e4e23..a0f4a2a008cc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4052,7 +4052,29 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 static unsigned long fault_around_bytes __read_mostly =
 	rounddown_pow_of_two(65536);
 
+static unsigned long numa_around_bytes __read_mostly;
+
 #ifdef CONFIG_DEBUG_FS
+static int numa_around_bytes_get(void *data, u64 *val)
+{
+	*val = numa_around_bytes;
+	return 0;
+}
+
+static int numa_around_bytes_set(void *data, u64 val)
+{
+	if (val / PAGE_SIZE > PTRS_PER_PTE)
+		return -EINVAL;
+	if (val > PAGE_SIZE)
+		numa_around_bytes = rounddown_pow_of_two(val);
+	else
+		numa_around_bytes = 0; /* rounddown_pow_of_two(0) is undefined */
+	return 0;
+}
+DEFINE_DEBUGFS_ATTRIBUTE(numa_around_bytes_fops,
+			 numa_around_bytes_get,
+			 numa_around_bytes_set, "%llu\n");
+
 static int fault_around_bytes_get(void *data, u64 *val)
 {
 	*val = fault_around_bytes;
@@ -4080,6 +4102,8 @@ static int __init fault_around_debugfs(void)
 {
 	debugfs_create_file_unsafe("fault_around_bytes", 0644, NULL, NULL,
 				   &fault_around_bytes_fops);
+	debugfs_create_file_unsafe("numa_around_bytes", 0644, NULL, NULL,
+				   &numa_around_bytes_fops);
 	return 0;
 }
 late_initcall(fault_around_debugfs);
@@ -4348,10 +4372,13 @@ static bool try_next_numa_page(struct vm_fault *vmf, unsigned int win_pages,
 	((win) & NUMA_FAULT_WINDOW_SIZE_MASK))
 
 static inline unsigned int numa_fault_max_pages(struct vm_area_struct *vma,
-						unsigned long fault_address)
+						unsigned long fault_address,
+						unsigned long numa_around_size)
 {
+	unsigned long numa_around_addr =
+		(fault_address + numa_around_size) & PAGE_MASK;
 	unsigned long pmd_end_addr = (fault_address & PMD_MASK) + PMD_SIZE;
-	unsigned long max_fault_addr = min_t(unsigned long, pmd_end_addr,
+	unsigned long max_fault_addr = min3(numa_around_addr, pmd_end_addr,
 					    vma->vm_end);
 
 	return (max_fault_addr - fault_address - 1) >> PAGE_SHIFT;
@@ -4360,12 +4387,24 @@ static inline unsigned int numa_fault_max_pages(struct vm_area_struct *vma,
 static unsigned int adjust_numa_fault_window(struct vm_area_struct *vma,
 					     unsigned long fault_address)
 {
+	unsigned long numa_around_size = READ_ONCE(numa_around_bytes);
 	unsigned long numafault_ahead = GET_NUMA_FAULT_INFO(vma);
         unsigned long prev_start = NUMA_FAULT_WINDOW_START(numafault_ahead);
         unsigned int prev_pages = NUMA_FAULT_WINDOW_SIZE(numafault_ahead);
 	unsigned long win_start;
 	unsigned int win_pages, max_fault_pages;
 
+	/*
+	 * Shut down the proactive numa fault if the numa_around_bytes
+	 * is set to 0.
+	 */
+	if (!numa_around_size) {
+		if (numafault_ahead)
+			atomic_long_set(&vma->numafault_ahead_info,
+					NUMA_FAULT_INFO(0, 0));
+		return 0;
+	}
+
 	win_start = fault_address + PAGE_SIZE;
 
 	/*
@@ -4437,7 +4476,8 @@ static unsigned int adjust_numa_fault_window(struct vm_area_struct *vma,
 	 * Make sure the size of ahead numa fault address is less than the
 	 * size of current VMA or PMD.
 	 */
-	max_fault_pages = numa_fault_max_pages(vma, fault_address);
+	max_fault_pages = numa_fault_max_pages(vma, fault_address,
+					       numa_around_size);
 	if (win_pages > max_fault_pages)
 		win_pages = max_fault_pages;
 

From patchwork Sun Dec 12 11:31:59 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Baolin Wang <baolin.wang@linux.alibaba.com>
X-Patchwork-Id: 12672223
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EBFB7C433EF
	for <linux-mm@archiver.kernel.org>; Sun, 12 Dec 2021 11:34:37 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6FFFB6B0078; Sun, 12 Dec 2021 06:32:37 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6AF796B007B; Sun, 12 Dec 2021 06:32:37 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5C56A6B007D; Sun, 12 Dec 2021 06:32:37 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0106.hostedemail.com
 [216.40.44.106])
	by kanga.kvack.org (Postfix) with ESMTP id 4D73B6B0078
	for <linux-mm@kvack.org>; Sun, 12 Dec 2021 06:32:37 -0500 (EST)
Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 079EB181AEF1F
	for <linux-mm@kvack.org>; Sun, 12 Dec 2021 11:32:27 +0000 (UTC)
X-FDA: 78908929134.26.4282E8E
Received: from out4436.biz.mail.alibaba.com (out4436.biz.mail.alibaba.com
 [47.88.44.36])
	by imf03.hostedemail.com (Postfix) with ESMTP id 53EBD20002
	for <linux-mm@kvack.org>; Sun, 12 Dec 2021 11:32:26 +0000 (UTC)
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R481e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04395;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0V-JZpk0_1639308732;
Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com
 fp:SMTPD_---0V-JZpk0_1639308732)
          by smtp.aliyun-inc.com(127.0.0.1);
          Sun, 12 Dec 2021 19:32:13 +0800
From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: akpm@linux-foundation.org,
	ying.huang@intel.com,
	dave.hansen@linux.intel.com
Cc: ziy@nvidia.com,
	shy828301@gmail.com,
	baolin.wang@linux.alibaba.com,
	zhongjiang-ali@linux.alibaba.com,
	xlpang@linux.alibaba.com,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: [RFC PATCH 3/4] mm: Add speculative numa fault stats
Date: Sun, 12 Dec 2021 19:31:59 +0800
Message-Id: 
 <bb19a1a3997a2424d13da02c6e7297e624e09a4f.1639306956.git.baolin.wang@linux.alibaba.com>
X-Mailer: git-send-email 1.8.3.1
In-Reply-To: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
References: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
In-Reply-To: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
References: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 53EBD20002
X-Stat-Signature: uja7atjmo4n18rzoz68r1knkdsf4g7dh
Authentication-Results: imf03.hostedemail.com;
	dkim=none;
	spf=pass (imf03.hostedemail.com: domain of baolin.wang@linux.alibaba.com
 designates 47.88.44.36 as permitted sender)
 smtp.mailfrom=baolin.wang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=alibaba.com
X-HE-Tag: 1639308746-278725
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Add a new statistic to help to tune the speculative numa fault window.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/vm_event_item.h | 1 +
 mm/memory.c                   | 2 ++
 mm/vmstat.c                   | 1 +
 3 files changed, 4 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index a185cc75ff52..97cdc661b7da 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -62,6 +62,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_MIGRATION_SUCCESS,
 		THP_MIGRATION_FAIL,
 		THP_MIGRATION_SPLIT,
+		PGMIGRATE_SPECULATION,
 #endif
 #ifdef CONFIG_COMPACTION
 		COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
diff --git a/mm/memory.c b/mm/memory.c
index a0f4a2a008cc..91122beb6e53 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4572,6 +4572,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	if (migrate_misplaced_page(page, vma, target_nid)) {
 		page_nid = target_nid;
 		flags |= TNF_MIGRATED;
+		if (vmf->address != fault_address)
+			count_vm_events(PGMIGRATE_SPECULATION, 1);
 	} else {
 		flags |= TNF_MIGRATE_FAIL;
 		vmf->pte = pte_offset_map(vmf->pmd, fault_address);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 787a012de3e2..c64700994786 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1314,6 +1314,7 @@ const char * const vmstat_text[] = {
 	"thp_migration_success",
 	"thp_migration_fail",
 	"thp_migration_split",
+	"pgmigrate_speculation",
 #endif
 #ifdef CONFIG_COMPACTION
 	"compact_migrate_scanned",

From patchwork Sun Dec 12 11:32:00 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Baolin Wang <baolin.wang@linux.alibaba.com>
X-Patchwork-Id: 12672221
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 037A3C433EF
	for <linux-mm@archiver.kernel.org>; Sun, 12 Dec 2021 11:34:03 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id CCBDF6B0075; Sun, 12 Dec 2021 06:32:34 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C7BB76B0078; Sun, 12 Dec 2021 06:32:34 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B6A9C6B007B; Sun, 12 Dec 2021 06:32:34 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0063.hostedemail.com
 [216.40.44.63])
	by kanga.kvack.org (Postfix) with ESMTP id A450F6B0075
	for <linux-mm@kvack.org>; Sun, 12 Dec 2021 06:32:34 -0500 (EST)
Received: from smtpin31.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 4E16682499B9
	for <linux-mm@kvack.org>; Sun, 12 Dec 2021 11:32:24 +0000 (UTC)
X-FDA: 78908929008.31.1F6DB12
Received: from out30-57.freemail.mail.aliyun.com
 (out30-57.freemail.mail.aliyun.com [115.124.30.57])
	by imf10.hostedemail.com (Postfix) with ESMTP id A14F9C0002
	for <linux-mm@kvack.org>; Sun, 12 Dec 2021 11:32:18 +0000 (UTC)
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0V-JXZHb_1639308733;
Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com
 fp:SMTPD_---0V-JXZHb_1639308733)
          by smtp.aliyun-inc.com(127.0.0.1);
          Sun, 12 Dec 2021 19:32:14 +0800
From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: akpm@linux-foundation.org,
	ying.huang@intel.com,
	dave.hansen@linux.intel.com
Cc: ziy@nvidia.com,
	shy828301@gmail.com,
	baolin.wang@linux.alibaba.com,
	zhongjiang-ali@linux.alibaba.com,
	xlpang@linux.alibaba.com,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: [RFC PATCH 4/4] mm: Update the speculative pages' accessing time
Date: Sun, 12 Dec 2021 19:32:00 +0800
Message-Id: 
 <c7d23137cbf56c3b0c81e98e3ed4676ca3e44dea.1639306956.git.baolin.wang@linux.alibaba.com>
X-Mailer: git-send-email 1.8.3.1
In-Reply-To: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
References: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
In-Reply-To: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
References: <cover.1639306956.git.baolin.wang@linux.alibaba.com>
Authentication-Results: imf10.hostedemail.com;
	dkim=none;
	spf=pass (imf10.hostedemail.com: domain of baolin.wang@linux.alibaba.com
 designates 115.124.30.57 as permitted sender)
 smtp.mailfrom=baolin.wang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=alibaba.com
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: A14F9C0002
X-Stat-Signature: mp18jg33z4g9ce1r3e588narxuoi71xf
X-HE-Tag: 1639308738-309235
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On some systems with different memory types, including fast memory (DRAM)
and slow memory (persistent memory), which will rely on the numa balancing
to promote slow and hot memory to fast memory to improve performance.
After supporting the speculative numa fault, we can update the next pages'
accessing time to help to promote it to fast memory node easily to
improve the performance.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/memory.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 91122beb6e53..e19b10299913 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4556,10 +4556,21 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	 * to record page access time.  So use default value.
 	 */
 	if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
-	    !node_is_toptier(page_nid))
+	    !node_is_toptier(page_nid)) {
 		last_cpupid = (-1 & LAST_CPUPID_MASK);
-	else
+		/*
+		 * According to the data locality for some workloads, the
+		 * probability of accessing some data soon after some nearby
+		 * data has been accessed. So for tiered memory systems, we
+		 * can update the sequential page's age located on slow memory
+		 * type, to try to promote it to fast memory in advance to
+		 * improve the performance.
+		 */
+		if (vmf->address != fault_address)
+			xchg_page_access_time(page, jiffies_to_msecs(jiffies));
+	} else {
 		last_cpupid = page_cpupid_last(page);
+	}
 	target_nid = numa_migrate_prep(page, vma, fault_address, page_nid,
 			&flags);
 	if (target_nid == NUMA_NO_NODE) {