From patchwork Sun Dec 12 11:31:57 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Baolin Wang X-Patchwork-Id: 12672215 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4E3A3C433F5 for ; Sun, 12 Dec 2021 11:32:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8DFF26B0073; Sun, 12 Dec 2021 06:32:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 891596B0075; Sun, 12 Dec 2021 06:32:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7738B6B0074; Sun, 12 Dec 2021 06:32:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0102.hostedemail.com [216.40.44.102]) by kanga.kvack.org (Postfix) with ESMTP id 68F876B0071 for ; Sun, 12 Dec 2021 06:32:26 -0500 (EST) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 2488B180EB6F1 for ; Sun, 12 Dec 2021 11:32:16 +0000 (UTC) X-FDA: 78908928672.27.7102FBA Received: from out30-42.freemail.mail.aliyun.com (out30-42.freemail.mail.aliyun.com [115.124.30.42]) by imf28.hostedemail.com (Postfix) with ESMTP id 2D325C000B for ; Sun, 12 Dec 2021 11:32:14 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R371e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04395;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0V-JhNbu_1639308730; Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0V-JhNbu_1639308730) by smtp.aliyun-inc.com(127.0.0.1); Sun, 12 Dec 2021 19:32:11 +0800 From: Baolin Wang To: akpm@linux-foundation.org, ying.huang@intel.com, dave.hansen@linux.intel.com Cc: ziy@nvidia.com, shy828301@gmail.com, baolin.wang@linux.alibaba.com, zhongjiang-ali@linux.alibaba.com, xlpang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 1/4] mm: Add speculative numa fault support Date: Sun, 12 Dec 2021 19:31:57 +0800 Message-Id: <0ec3e9ce4b564bee3883b6141b1f9f2498188002.1639306956.git.baolin.wang@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: References: In-Reply-To: References: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 2D325C000B X-Stat-Signature: xpn714maghsw7wuy7mbodtjakbddbh3f Authentication-Results: imf28.hostedemail.com; dkim=none; spf=pass (imf28.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.42 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-HE-Tag: 1639308734-19413 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Some workloads access a set of data entities will follow the data locality, also known as locality of reference, which means the probability of accessing some data soon after some nearby data has been accessed. On some systems with different memory types, which will rely on the numa balancing to promote slow hot memory to fast memory to improve performance. So we can promote several sequential pages on slow memory in advance according to the data locality for some workloads to improve the performance. Thus this patch supports speculative numa fault mechanism to help to migrate suitable pages in advance to improve the performance. And now the basic concept of the speculative numa fault is that, it will add a new member for each VMA to record the numa fault window, which will record the last numa fault address and the pages need to be migrated to the target node. So when numa fault occurs, we will check the last numa fault window for current VMA to check if it is a sequential stream accessing, if yes, we can expand the numa fault window; if not, we can reduce the numa fault winow or close the speculative numa fault to avoid doing unnecessary migration. Testing with mysql can show about 6% performance improved as below. Machine: 16 CPUs, 64G DRAM, 256G AEP sysbench /usr/share/sysbench/tests/include/oltp_legacy/oltp.lua --mysql-user=root --mysql-password=root --oltp-test-mode=complex --oltp-tables-count=80 --oltp-table-size=5000000 --threads=20 --time=600 --report-interval=10 prepare/run No speculative numa fault: queries performed: read: 33039860 write: 9439960 other: 4719980 total: 47199800 transactions: 2359990 (3933.28 per sec.) queries: 47199800 (78665.50 per sec.) Speculative numa fault: queries performed: read: 34896862 write: 9970532 other: 4985266 total: 49852660 transactions: 2492633 (4154.35 per sec.) queries: 49852660 (83086.94 per sec.) Signed-off-by: Baolin Wang --- include/linux/mm_types.h | 3 + mm/memory.c | 165 ++++++++++++++++++++++++++++++++++++--- 2 files changed, 159 insertions(+), 9 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 449b6eafc695..8d8381e9aec9 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -474,6 +474,9 @@ struct vm_area_struct { #endif #ifdef CONFIG_NUMA struct mempolicy *vm_policy; /* NUMA policy for the VMA */ +#endif +#ifdef CONFIG_NUMA_BALANCING + atomic_long_t numafault_ahead_info; #endif struct vm_userfaultfd_ctx vm_userfaultfd_ctx; } __randomize_layout; diff --git a/mm/memory.c b/mm/memory.c index 2291417783bc..2c9ed63e4e23 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -74,6 +74,8 @@ #include #include #include +#include +#include #include @@ -4315,16 +4317,156 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, return mpol_misplaced(page, vma, addr); } +static bool try_next_numa_page(struct vm_fault *vmf, unsigned int win_pages, + unsigned long *fault_addr) +{ + unsigned long next_fault_addr = *fault_addr + PAGE_SIZE; + unsigned long numa_fault_end = vmf->address + (win_pages + 1) * PAGE_SIZE; + + if (next_fault_addr > numa_fault_end) + return false; + + *fault_addr = next_fault_addr; + vmf->pte = pte_offset_map(vmf->pmd, next_fault_addr); + vmf->orig_pte = *vmf->pte; + if (pte_protnone(vmf->orig_pte)) + return true; + + return false; +} + +#define NUMA_FAULT_AHEAD_DEFUALT 2 +#define NUMA_FAULT_EXPAND_STEP 1 +#define NUMA_FAULT_REDUCE_STEP 2 +#define GET_NUMA_FAULT_INFO(vma) \ + (atomic_long_read(&(vma)->numafault_ahead_info)) +#define NUMA_FAULT_WINDOW_START(v) ((v) & PAGE_MASK) +#define NUMA_FAULT_WINDOW_SIZE_MASK ((1UL << PAGE_SHIFT) - 1) +#define NUMA_FAULT_WINDOW_SIZE(v) ((v) & NUMA_FAULT_WINDOW_SIZE_MASK) +#define NUMA_FAULT_INFO(addr, win) \ + (((addr) & PAGE_MASK) | \ + ((win) & NUMA_FAULT_WINDOW_SIZE_MASK)) + +static inline unsigned int numa_fault_max_pages(struct vm_area_struct *vma, + unsigned long fault_address) +{ + unsigned long pmd_end_addr = (fault_address & PMD_MASK) + PMD_SIZE; + unsigned long max_fault_addr = min_t(unsigned long, pmd_end_addr, + vma->vm_end); + + return (max_fault_addr - fault_address - 1) >> PAGE_SHIFT; +} + +static unsigned int adjust_numa_fault_window(struct vm_area_struct *vma, + unsigned long fault_address) +{ + unsigned long numafault_ahead = GET_NUMA_FAULT_INFO(vma); + unsigned long prev_start = NUMA_FAULT_WINDOW_START(numafault_ahead); + unsigned int prev_pages = NUMA_FAULT_WINDOW_SIZE(numafault_ahead); + unsigned long win_start; + unsigned int win_pages, max_fault_pages; + + win_start = fault_address + PAGE_SIZE; + + /* + * Start accessing the VMA, then just open a small window to try. + */ + if (!numafault_ahead) { + win_pages = NUMA_FAULT_AHEAD_DEFUALT; + goto out; + } + + /* + * If last numa fault window was close, we should check if current fault + * address is continue with previous fault addess before opening the + * new numa fault window. + */ + if (!prev_pages) { + if (fault_address == prev_start || + fault_address == prev_start + PAGE_SIZE) + win_pages = NUMA_FAULT_AHEAD_DEFUALT; + else + win_pages = 0; + + goto out; + } + + /* + * TODO: need check the fault addess is occured before the last numa + * fault window. + */ + if (fault_address >= prev_start) { + unsigned long prev_end = prev_start + prev_pages * PAGE_SIZE; + + /* + * Continue with the previous numa fault window, then assume + * it is a sequential accessing, which need expand the numa fault + * window. + */ + if (fault_address == prev_end || + fault_address == prev_end + PAGE_SIZE) { + win_pages = prev_pages + NUMA_FAULT_EXPAND_STEP; + goto validate_out; + } else if (fault_address < prev_end) { + /* + * If current fault address is in the range of last numa + * fault window, which means the pages in last numa fault + * window were not all migrated successfully, so just + * keep current size of last numa fault window to try + * again, since last numa fault window speculation may + * be on the correct way. + */ + win_pages = prev_pages; + goto validate_out; + } + } + + /* + * Until now assume random accessing, reduce the numa fault window + * by step. + */ + if (prev_pages <= NUMA_FAULT_REDUCE_STEP) { + win_pages = 0; + goto out; + } else { + win_pages = prev_pages - NUMA_FAULT_REDUCE_STEP; + } + +validate_out: + /* + * Make sure the size of ahead numa fault address is less than the + * size of current VMA or PMD. + */ + max_fault_pages = numa_fault_max_pages(vma, fault_address); + if (win_pages > max_fault_pages) + win_pages = max_fault_pages; + +out: + atomic_long_set(&vma->numafault_ahead_info, + NUMA_FAULT_INFO(win_start, win_pages)); + return win_pages; +} + static vm_fault_t do_numa_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct page *page = NULL; - int page_nid = NUMA_NO_NODE; + int page_nid; int last_cpupid; int target_nid; pte_t pte, old_pte; - bool was_writable = pte_savedwrite(vmf->orig_pte); - int flags = 0; + bool was_writable; + int flags; + unsigned long fault_address = vmf->address; + unsigned int win_pages; + + /* Try to speculate the numa fault window for current VMA. */ + win_pages = adjust_numa_fault_window(vma, fault_address); + +try_next: + was_writable = pte_savedwrite(vmf->orig_pte); + flags = 0; + page_nid = NUMA_NO_NODE; /* * The "pte" at this point cannot be used safely without @@ -4342,7 +4484,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) old_pte = ptep_get(vmf->pte); pte = pte_modify(old_pte, vma->vm_page_prot); - page = vm_normal_page(vma, vmf->address, pte); + page = vm_normal_page(vma, fault_address, pte); if (!page) goto out_map; @@ -4378,7 +4520,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) last_cpupid = (-1 & LAST_CPUPID_MASK); else last_cpupid = page_cpupid_last(page); - target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid, + target_nid = numa_migrate_prep(page, vma, fault_address, page_nid, &flags); if (target_nid == NUMA_NO_NODE) { put_page(page); @@ -4392,7 +4534,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) flags |= TNF_MIGRATED; } else { flags |= TNF_MIGRATE_FAIL; - vmf->pte = pte_offset_map(vmf->pmd, vmf->address); + vmf->pte = pte_offset_map(vmf->pmd, fault_address); spin_lock(vmf->ptl); if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) { pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -4404,19 +4546,24 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) out: if (page_nid != NUMA_NO_NODE) task_numa_fault(last_cpupid, page_nid, 1, flags); + + if ((flags & TNF_MIGRATED) && (win_pages > 0) && + try_next_numa_page(vmf, win_pages, &fault_address)) + goto try_next; + return 0; out_map: /* * Make it present again, depending on how arch implements * non-accessible ptes, some can allow access by kernel mode. */ - old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); + old_pte = ptep_modify_prot_start(vma, fault_address, vmf->pte); pte = pte_modify(old_pte, vma->vm_page_prot); pte = pte_mkyoung(pte); if (was_writable) pte = pte_mkwrite(pte); - ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); - update_mmu_cache(vma, vmf->address, vmf->pte); + ptep_modify_prot_commit(vma, fault_address, vmf->pte, old_pte, pte); + update_mmu_cache(vma, fault_address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); goto out; } From patchwork Sun Dec 12 11:31:58 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Baolin Wang X-Patchwork-Id: 12672219 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 50875C433F5 for ; Sun, 12 Dec 2021 11:33:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 007C46B0074; Sun, 12 Dec 2021 06:32:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DDAF26B007B; Sun, 12 Dec 2021 06:32:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CA3446B0078; Sun, 12 Dec 2021 06:32:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0093.hostedemail.com [216.40.44.93]) by kanga.kvack.org (Postfix) with ESMTP id A766C6B007B for ; Sun, 12 Dec 2021 06:32:26 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 658A28249980 for ; Sun, 12 Dec 2021 11:32:16 +0000 (UTC) X-FDA: 78908928672.13.13F20C2 Received: from out30-42.freemail.mail.aliyun.com (out30-42.freemail.mail.aliyun.com [115.124.30.42]) by imf14.hostedemail.com (Postfix) with ESMTP id 656BA100007 for ; Sun, 12 Dec 2021 11:32:14 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04426;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0V-JZpjp_1639308731; Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0V-JZpjp_1639308731) by smtp.aliyun-inc.com(127.0.0.1); Sun, 12 Dec 2021 19:32:12 +0800 From: Baolin Wang To: akpm@linux-foundation.org, ying.huang@intel.com, dave.hansen@linux.intel.com Cc: ziy@nvidia.com, shy828301@gmail.com, baolin.wang@linux.alibaba.com, zhongjiang-ali@linux.alibaba.com, xlpang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 2/4] mm: Add a debug interface to control the range of speculative numa fault Date: Sun, 12 Dec 2021 19:31:58 +0800 Message-Id: <913a8a5282d265dc771309ca552c9c62c247c2b0.1639306956.git.baolin.wang@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: References: In-Reply-To: References: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 656BA100007 X-Stat-Signature: sp8kw93so6hdhfh398hkbodqgokw3sfz Authentication-Results: imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.42 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-HE-Tag: 1639308734-599959 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add a debug interface to control the range of speculative numa fault, which can be used to tuning the performance or event close the speculative numa fault window for some workloads. Signed-off-by: Baolin Wang --- mm/memory.c | 46 +++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 43 insertions(+), 3 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 2c9ed63e4e23..a0f4a2a008cc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4052,7 +4052,29 @@ vm_fault_t finish_fault(struct vm_fault *vmf) static unsigned long fault_around_bytes __read_mostly = rounddown_pow_of_two(65536); +static unsigned long numa_around_bytes __read_mostly; + #ifdef CONFIG_DEBUG_FS +static int numa_around_bytes_get(void *data, u64 *val) +{ + *val = numa_around_bytes; + return 0; +} + +static int numa_around_bytes_set(void *data, u64 val) +{ + if (val / PAGE_SIZE > PTRS_PER_PTE) + return -EINVAL; + if (val > PAGE_SIZE) + numa_around_bytes = rounddown_pow_of_two(val); + else + numa_around_bytes = 0; /* rounddown_pow_of_two(0) is undefined */ + return 0; +} +DEFINE_DEBUGFS_ATTRIBUTE(numa_around_bytes_fops, + numa_around_bytes_get, + numa_around_bytes_set, "%llu\n"); + static int fault_around_bytes_get(void *data, u64 *val) { *val = fault_around_bytes; @@ -4080,6 +4102,8 @@ static int __init fault_around_debugfs(void) { debugfs_create_file_unsafe("fault_around_bytes", 0644, NULL, NULL, &fault_around_bytes_fops); + debugfs_create_file_unsafe("numa_around_bytes", 0644, NULL, NULL, + &numa_around_bytes_fops); return 0; } late_initcall(fault_around_debugfs); @@ -4348,10 +4372,13 @@ static bool try_next_numa_page(struct vm_fault *vmf, unsigned int win_pages, ((win) & NUMA_FAULT_WINDOW_SIZE_MASK)) static inline unsigned int numa_fault_max_pages(struct vm_area_struct *vma, - unsigned long fault_address) + unsigned long fault_address, + unsigned long numa_around_size) { + unsigned long numa_around_addr = + (fault_address + numa_around_size) & PAGE_MASK; unsigned long pmd_end_addr = (fault_address & PMD_MASK) + PMD_SIZE; - unsigned long max_fault_addr = min_t(unsigned long, pmd_end_addr, + unsigned long max_fault_addr = min3(numa_around_addr, pmd_end_addr, vma->vm_end); return (max_fault_addr - fault_address - 1) >> PAGE_SHIFT; @@ -4360,12 +4387,24 @@ static inline unsigned int numa_fault_max_pages(struct vm_area_struct *vma, static unsigned int adjust_numa_fault_window(struct vm_area_struct *vma, unsigned long fault_address) { + unsigned long numa_around_size = READ_ONCE(numa_around_bytes); unsigned long numafault_ahead = GET_NUMA_FAULT_INFO(vma); unsigned long prev_start = NUMA_FAULT_WINDOW_START(numafault_ahead); unsigned int prev_pages = NUMA_FAULT_WINDOW_SIZE(numafault_ahead); unsigned long win_start; unsigned int win_pages, max_fault_pages; + /* + * Shut down the proactive numa fault if the numa_around_bytes + * is set to 0. + */ + if (!numa_around_size) { + if (numafault_ahead) + atomic_long_set(&vma->numafault_ahead_info, + NUMA_FAULT_INFO(0, 0)); + return 0; + } + win_start = fault_address + PAGE_SIZE; /* @@ -4437,7 +4476,8 @@ static unsigned int adjust_numa_fault_window(struct vm_area_struct *vma, * Make sure the size of ahead numa fault address is less than the * size of current VMA or PMD. */ - max_fault_pages = numa_fault_max_pages(vma, fault_address); + max_fault_pages = numa_fault_max_pages(vma, fault_address, + numa_around_size); if (win_pages > max_fault_pages) win_pages = max_fault_pages; From patchwork Sun Dec 12 11:31:59 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Baolin Wang X-Patchwork-Id: 12672223 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EBFB7C433EF for ; Sun, 12 Dec 2021 11:34:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6FFFB6B0078; Sun, 12 Dec 2021 06:32:37 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6AF796B007B; Sun, 12 Dec 2021 06:32:37 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5C56A6B007D; Sun, 12 Dec 2021 06:32:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0106.hostedemail.com [216.40.44.106]) by kanga.kvack.org (Postfix) with ESMTP id 4D73B6B0078 for ; Sun, 12 Dec 2021 06:32:37 -0500 (EST) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 079EB181AEF1F for ; Sun, 12 Dec 2021 11:32:27 +0000 (UTC) X-FDA: 78908929134.26.4282E8E Received: from out4436.biz.mail.alibaba.com (out4436.biz.mail.alibaba.com [47.88.44.36]) by imf03.hostedemail.com (Postfix) with ESMTP id 53EBD20002 for ; Sun, 12 Dec 2021 11:32:26 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R481e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04395;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0V-JZpk0_1639308732; Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0V-JZpk0_1639308732) by smtp.aliyun-inc.com(127.0.0.1); Sun, 12 Dec 2021 19:32:13 +0800 From: Baolin Wang To: akpm@linux-foundation.org, ying.huang@intel.com, dave.hansen@linux.intel.com Cc: ziy@nvidia.com, shy828301@gmail.com, baolin.wang@linux.alibaba.com, zhongjiang-ali@linux.alibaba.com, xlpang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 3/4] mm: Add speculative numa fault stats Date: Sun, 12 Dec 2021 19:31:59 +0800 Message-Id: X-Mailer: git-send-email 1.8.3.1 In-Reply-To: References: In-Reply-To: References: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 53EBD20002 X-Stat-Signature: uja7atjmo4n18rzoz68r1knkdsf4g7dh Authentication-Results: imf03.hostedemail.com; dkim=none; spf=pass (imf03.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 47.88.44.36 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-HE-Tag: 1639308746-278725 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add a new statistic to help to tune the speculative numa fault window. Signed-off-by: Baolin Wang --- include/linux/vm_event_item.h | 1 + mm/memory.c | 2 ++ mm/vmstat.c | 1 + 3 files changed, 4 insertions(+) diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index a185cc75ff52..97cdc661b7da 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -62,6 +62,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_MIGRATION_SUCCESS, THP_MIGRATION_FAIL, THP_MIGRATION_SPLIT, + PGMIGRATE_SPECULATION, #endif #ifdef CONFIG_COMPACTION COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED, diff --git a/mm/memory.c b/mm/memory.c index a0f4a2a008cc..91122beb6e53 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4572,6 +4572,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) if (migrate_misplaced_page(page, vma, target_nid)) { page_nid = target_nid; flags |= TNF_MIGRATED; + if (vmf->address != fault_address) + count_vm_events(PGMIGRATE_SPECULATION, 1); } else { flags |= TNF_MIGRATE_FAIL; vmf->pte = pte_offset_map(vmf->pmd, fault_address); diff --git a/mm/vmstat.c b/mm/vmstat.c index 787a012de3e2..c64700994786 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1314,6 +1314,7 @@ const char * const vmstat_text[] = { "thp_migration_success", "thp_migration_fail", "thp_migration_split", + "pgmigrate_speculation", #endif #ifdef CONFIG_COMPACTION "compact_migrate_scanned", From patchwork Sun Dec 12 11:32:00 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Baolin Wang X-Patchwork-Id: 12672221 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 037A3C433EF for ; Sun, 12 Dec 2021 11:34:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CCBDF6B0075; Sun, 12 Dec 2021 06:32:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C7BB76B0078; Sun, 12 Dec 2021 06:32:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B6A9C6B007B; Sun, 12 Dec 2021 06:32:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0063.hostedemail.com [216.40.44.63]) by kanga.kvack.org (Postfix) with ESMTP id A450F6B0075 for ; Sun, 12 Dec 2021 06:32:34 -0500 (EST) Received: from smtpin31.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 4E16682499B9 for ; Sun, 12 Dec 2021 11:32:24 +0000 (UTC) X-FDA: 78908929008.31.1F6DB12 Received: from out30-57.freemail.mail.aliyun.com (out30-57.freemail.mail.aliyun.com [115.124.30.57]) by imf10.hostedemail.com (Postfix) with ESMTP id A14F9C0002 for ; Sun, 12 Dec 2021 11:32:18 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0V-JXZHb_1639308733; Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0V-JXZHb_1639308733) by smtp.aliyun-inc.com(127.0.0.1); Sun, 12 Dec 2021 19:32:14 +0800 From: Baolin Wang To: akpm@linux-foundation.org, ying.huang@intel.com, dave.hansen@linux.intel.com Cc: ziy@nvidia.com, shy828301@gmail.com, baolin.wang@linux.alibaba.com, zhongjiang-ali@linux.alibaba.com, xlpang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 4/4] mm: Update the speculative pages' accessing time Date: Sun, 12 Dec 2021 19:32:00 +0800 Message-Id: X-Mailer: git-send-email 1.8.3.1 In-Reply-To: References: In-Reply-To: References: Authentication-Results: imf10.hostedemail.com; dkim=none; spf=pass (imf10.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.57 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: A14F9C0002 X-Stat-Signature: mp18jg33z4g9ce1r3e588narxuoi71xf X-HE-Tag: 1639308738-309235 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On some systems with different memory types, including fast memory (DRAM) and slow memory (persistent memory), which will rely on the numa balancing to promote slow and hot memory to fast memory to improve performance. After supporting the speculative numa fault, we can update the next pages' accessing time to help to promote it to fast memory node easily to improve the performance. Signed-off-by: Baolin Wang --- mm/memory.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 91122beb6e53..e19b10299913 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4556,10 +4556,21 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) * to record page access time. So use default value. */ if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && - !node_is_toptier(page_nid)) + !node_is_toptier(page_nid)) { last_cpupid = (-1 & LAST_CPUPID_MASK); - else + /* + * According to the data locality for some workloads, the + * probability of accessing some data soon after some nearby + * data has been accessed. So for tiered memory systems, we + * can update the sequential page's age located on slow memory + * type, to try to promote it to fast memory in advance to + * improve the performance. + */ + if (vmf->address != fault_address) + xchg_page_access_time(page, jiffies_to_msecs(jiffies)); + } else { last_cpupid = page_cpupid_last(page); + } target_nid = numa_migrate_prep(page, vma, fault_address, page_nid, &flags); if (target_nid == NUMA_NO_NODE) {