From patchwork Tue Feb 18 08:26:27 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 11387979 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AEFE11580 for ; Tue, 18 Feb 2020 08:27:21 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 782AB24654 for ; Tue, 18 Feb 2020 08:27:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 782AB24654 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 431106B0006; Tue, 18 Feb 2020 03:27:20 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 3B7806B0007; Tue, 18 Feb 2020 03:27:20 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2CEFD6B0008; Tue, 18 Feb 2020 03:27:20 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0135.hostedemail.com [216.40.44.135]) by kanga.kvack.org (Postfix) with ESMTP id 161D36B0006 for ; Tue, 18 Feb 2020 03:27:20 -0500 (EST) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 9EA222C98 for ; Tue, 18 Feb 2020 08:27:19 +0000 (UTC) X-FDA: 76502568198.18.corn31_3fdb4827c8d3b X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,ying.huang@intel.com,:peterz@infradead.org:mingo@kernel.org::linux-kernel@vger.kernel.org:feng.tang@intel.com:ying.huang@intel.com:akpm@linux-foundation.org:mhocko@suse.com:riel@redhat.com:mgorman@suse.de:dave.hansen@linux.intel.com:dan.j.williams@intel.com,RULES_HIT:30045:30051:30054:30064:30070:30090,0,RBL:134.134.136.100:@intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:25,LUA_SUMMARY:none X-HE-Tag: corn31_3fdb4827c8d3b X-Filterd-Recvd-Size: 6714 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf25.hostedemail.com (Postfix) with ESMTP for ; Tue, 18 Feb 2020 08:27:18 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Feb 2020 00:27:17 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,455,1574150400"; d="scan'208";a="235466642" Received: from yhuang-dev.sh.intel.com ([10.239.159.151]) by orsmga003.jf.intel.com with ESMTP; 18 Feb 2020 00:27:15 -0800 From: "Huang, Ying" To: Peter Zijlstra , Ingo Molnar Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Feng Tang , Huang Ying , Andrew Morton , Michal Hocko , Rik van Riel , Mel Gorman , Dave Hansen , Dan Williams Subject: [RFC -V2 1/8] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode Date: Tue, 18 Feb 2020 16:26:27 +0800 Message-Id: <20200218082634.1596727-2-ying.huang@intel.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200218082634.1596727-1-ying.huang@intel.com> References: <20200218082634.1596727-1-ying.huang@intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Huang Ying With the advent of various new memory types, some machines will have multiple memory types, e.g. DRAM and PMEM (persistent memory). Because the performance of the different types of memory may be different, the memory subsystem could be called memory tiering system. In a typical memory tiering system, there are CPUs, fast memory and slow memory in each physical NUMA node. The CPUs and the fast memory will be put in one logical node (called fast memory node), while the slow memory will be put in another (faked) logical node (called slow memory node). And in autonuma, there are a set of mechanisms to identify the pages recently accessed by the CPUs in a node and migrate the pages to the node. So the performance optimization to promote the hot pages in slow memory node to the fast memory node in the memory tiering system could be implemented based on the autonuma framework. But the requirement of the hot page promotion in the memory tiering system is different from that of the normal NUMA balancing in some aspects. E.g. for the hot page promotion, we can skip to scan fastest memory node because we have nowhere to promote the hot pages to. To make autonuma works for both the normal NUMA balancing and the memory tiering hot page promotion, we have defined a set of flags and made the value of sysctl_numa_balancing_mode to be "OR" of these flags. The flags are as follows, - 0x0: NUMA_BALANCING_DISABLED - 0x1: NUMA_BALANCING_NORMAL - 0x2: NUMA_BALANCING_MEMORY_TIERING NUMA_BALANCING_NORMAL enables the normal NUMA balancing across sockets, while NUMA_BALANCING_MEMORY_TIERING enables the hot page promotion across memory tiers. They can be enabled individually or together. If all flags are cleared, the autonuma is disabled completely. The sysctl interface is extended accordingly in a backward compatible way. TODO: - Update ABI document: Documentation/sysctl/kernel.txt Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Michal Hocko Cc: Rik van Riel Cc: Mel Gorman Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Dave Hansen Cc: Dan Williams Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/sched/sysctl.h | 5 +++++ kernel/sched/core.c | 9 +++------ kernel/sysctl.c | 7 ++++--- 3 files changed, 12 insertions(+), 9 deletions(-) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index d4f6215ee03f..80dc5030c797 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -33,6 +33,11 @@ enum sched_tunable_scaling { }; extern enum sched_tunable_scaling sysctl_sched_tunable_scaling; +#define NUMA_BALANCING_DISABLED 0x0 +#define NUMA_BALANCING_NORMAL 0x1 +#define NUMA_BALANCING_MEMORY_TIERING 0x2 + +extern int sysctl_numa_balancing_mode; extern unsigned int sysctl_numa_balancing_scan_delay; extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 90e4b00ace89..2d3f456d0ef6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2723,6 +2723,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) } DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); +int sysctl_numa_balancing_mode; #ifdef CONFIG_NUMA_BALANCING @@ -2738,20 +2739,16 @@ void set_numabalancing_state(bool enabled) int sysctl_numa_balancing(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) { - struct ctl_table t; int err; - int state = static_branch_likely(&sched_numa_balancing); if (write && !capable(CAP_SYS_ADMIN)) return -EPERM; - t = *table; - t.data = &state; - err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos); + err = proc_dointvec_minmax(table, write, buffer, lenp, ppos); if (err < 0) return err; if (write) - set_numabalancing_state(state); + set_numabalancing_state(*(int *)table->data); return err; } #endif diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 70665934d53e..3756108bb658 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -126,6 +126,7 @@ static int sixty = 60; static int __maybe_unused neg_one = -1; static int __maybe_unused two = 2; +static int __maybe_unused three = 3; static int __maybe_unused four = 4; static unsigned long zero_ul; static unsigned long one_ul = 1; @@ -420,12 +421,12 @@ static struct ctl_table kern_table[] = { }, { .procname = "numa_balancing", - .data = NULL, /* filled in by handler */ - .maxlen = sizeof(unsigned int), + .data = &sysctl_numa_balancing_mode, + .maxlen = sizeof(int), .mode = 0644, .proc_handler = sysctl_numa_balancing, .extra1 = SYSCTL_ZERO, - .extra2 = SYSCTL_ONE, + .extra2 = &three, }, #endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_SCHED_DEBUG */ From patchwork Tue Feb 18 08:26:28 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 11387981 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CAD741580 for ; Tue, 18 Feb 2020 08:27:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 94BC22467A for ; Tue, 18 Feb 2020 08:27:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 94BC22467A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A22956B0007; Tue, 18 Feb 2020 03:27:23 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id A016F6B0008; Tue, 18 Feb 2020 03:27:23 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8E87C6B000A; Tue, 18 Feb 2020 03:27:23 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0116.hostedemail.com [216.40.44.116]) by kanga.kvack.org (Postfix) with ESMTP id 763286B0007 for ; Tue, 18 Feb 2020 03:27:23 -0500 (EST) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 028532478 for ; Tue, 18 Feb 2020 08:27:23 +0000 (UTC) X-FDA: 76502568366.29.river87_405a40921560d X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,ying.huang@intel.com,:peterz@infradead.org:mingo@kernel.org::linux-kernel@vger.kernel.org:feng.tang@intel.com:ying.huang@intel.com:akpm@linux-foundation.org:mhocko@suse.com:riel@redhat.com:mgorman@suse.de:dave.hansen@linux.intel.com:dan.j.williams@intel.com,RULES_HIT:30003:30034:30054:30064:30070:30090,0,RBL:134.134.136.100:@intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: river87_405a40921560d X-Filterd-Recvd-Size: 8755 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf14.hostedemail.com (Postfix) with ESMTP for ; Tue, 18 Feb 2020 08:27:22 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Feb 2020 00:27:21 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,455,1574150400"; d="scan'208";a="235466657" Received: from yhuang-dev.sh.intel.com ([10.239.159.151]) by orsmga003.jf.intel.com with ESMTP; 18 Feb 2020 00:27:18 -0800 From: "Huang, Ying" To: Peter Zijlstra , Ingo Molnar Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Feng Tang , Huang Ying , Andrew Morton , Michal Hocko , Rik van Riel , Mel Gorman , Dave Hansen , Dan Williams Subject: [RFC -V2 2/8] autonuma, memory tiering: Rate limit NUMA migration throughput Date: Tue, 18 Feb 2020 16:26:28 +0800 Message-Id: <20200218082634.1596727-3-ying.huang@intel.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200218082634.1596727-1-ying.huang@intel.com> References: <20200218082634.1596727-1-ying.huang@intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Huang Ying In autonuma memory tiering mode, the hot PMEM (persistent memory) pages could be migrated to DRAM via autonuma. But this incurs some overhead too. So that sometimes the workload performance may be hurt. To avoid too much disturbing to the workload, the migration throughput should be rate-limited. At the other hand, in some situation, for example, some workloads exits, many DRAM pages become free, so that some pages of the other workloads can be migrated to DRAM. To respond to the workloads changing quickly, it's better to migrate pages faster. To address the above 2 requirements, a rate limit algorithm as follows is used, - If there is enough free memory in DRAM node (that is, > high watermark + 2 * rate limit pages), then NUMA migration throughput will not be rate-limited to respond to the workload changing quickly. - Otherwise, counting the number of pages to try to migrate to a DRAM node via autonuma, if the count exceeds the limit specified by the users, stop NUMA migration until the next second. A new sysctl knob kernel.numa_balancing_rate_limit_mbps is added for the users to specify the limit. If its value is 0, the default value (high watermark) will be used. TODO: Add ABI document for new sysctl knob. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Michal Hocko Cc: Rik van Riel Cc: Mel Gorman Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Dave Hansen Cc: Dan Williams Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/mmzone.h | 7 ++++ include/linux/sched/sysctl.h | 6 ++++ kernel/sched/fair.c | 62 ++++++++++++++++++++++++++++++++++++ kernel/sysctl.c | 8 +++++ mm/vmstat.c | 3 ++ 5 files changed, 86 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index dfb09106ad70..6e7a28becdc2 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -249,6 +249,9 @@ enum node_stat_item { NR_DIRTIED, /* page dirtyings since bootup */ NR_WRITTEN, /* page writings since bootup */ NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */ +#ifdef CONFIG_NUMA_BALANCING + NUMA_TRY_MIGRATE, /* pages to try to migrate via NUMA balancing */ +#endif NR_VM_NODE_STAT_ITEMS }; @@ -786,6 +789,10 @@ typedef struct pglist_data { struct deferred_split deferred_split_queue; #endif +#ifdef CONFIG_NUMA_BALANCING + unsigned long numa_ts; + unsigned long numa_try; +#endif /* Fields commonly accessed by the page reclaim scanner */ /* diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 80dc5030c797..c4b27790b901 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -43,6 +43,12 @@ extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; +#ifdef CONFIG_NUMA_BALANCING +extern unsigned int sysctl_numa_balancing_rate_limit; +#else +#define sysctl_numa_balancing_rate_limit 0 +#endif + #ifdef CONFIG_SCHED_DEBUG extern __read_mostly unsigned int sysctl_sched_migration_cost; extern __read_mostly unsigned int sysctl_sched_nr_migrate; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ba749f579714..ef694816150b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1064,6 +1064,12 @@ unsigned int sysctl_numa_balancing_scan_size = 256; /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */ unsigned int sysctl_numa_balancing_scan_delay = 1000; +/* + * Restrict the NUMA migration per second in MB for each target node + * if no enough free space in target node + */ +unsigned int sysctl_numa_balancing_rate_limit; + struct numa_group { refcount_t refcount; @@ -1404,6 +1410,43 @@ static inline unsigned long group_weight(struct task_struct *p, int nid, return 1000 * faults / total_faults; } +static bool pgdat_free_space_enough(struct pglist_data *pgdat) +{ + int z; + unsigned long rate_limit; + + rate_limit = sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT); + for (z = pgdat->nr_zones - 1; z >= 0; z--) { + struct zone *zone = pgdat->node_zones + z; + + if (!populated_zone(zone)) + continue; + + if (zone_watermark_ok(zone, 0, + high_wmark_pages(zone) + rate_limit * 2, + ZONE_MOVABLE, 0)) + return true; + } + return false; +} + +static bool numa_migration_check_rate_limit(struct pglist_data *pgdat, + unsigned long rate_limit, int nr) +{ + unsigned long try; + unsigned long now = jiffies, last_ts; + + mod_node_page_state(pgdat, NUMA_TRY_MIGRATE, nr); + try = node_page_state(pgdat, NUMA_TRY_MIGRATE); + last_ts = pgdat->numa_ts; + if (now > last_ts + HZ && + cmpxchg(&pgdat->numa_ts, last_ts, now) == last_ts) + pgdat->numa_try = try; + if (try - pgdat->numa_try > rate_limit) + return false; + return true; +} + bool should_numa_migrate_memory(struct task_struct *p, struct page * page, int src_nid, int dst_cpu) { @@ -1411,6 +1454,25 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, int dst_nid = cpu_to_node(dst_cpu); int last_cpupid, this_cpupid; + /* + * If memory tiering mode is enabled, will try promote pages + * in slow memory node to fast memory node. + */ + if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && + next_promotion_node(src_nid) != -1) { + struct pglist_data *pgdat; + unsigned long rate_limit; + + pgdat = NODE_DATA(dst_nid); + if (pgdat_free_space_enough(pgdat)) + return true; + + rate_limit = + sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT); + return numa_migration_check_rate_limit(pgdat, rate_limit, + hpage_nr_pages(page)); + } + this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); last_cpupid = page_cpupid_xchg_last(page, this_cpupid); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 3756108bb658..2d19e821267a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -419,6 +419,14 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ONE, }, + { + .procname = "numa_balancing_rate_limit_mbps", + .data = &sysctl_numa_balancing_rate_limit, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + }, { .procname = "numa_balancing", .data = &sysctl_numa_balancing_mode, diff --git a/mm/vmstat.c b/mm/vmstat.c index d76714d2fd7c..9326512c612c 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1203,6 +1203,9 @@ const char * const vmstat_text[] = { "nr_dirtied", "nr_written", "nr_kernel_misc_reclaimable", +#ifdef CONFIG_NUMA_BALANCING + "numa_try_migrate", +#endif /* enum writeback_stat_item counters */ "nr_dirty_threshold", From patchwork Tue Feb 18 08:26:29 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 11387983 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7367B13A4 for ; Tue, 18 Feb 2020 08:27:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3C74821D56 for ; Tue, 18 Feb 2020 08:27:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3C74821D56 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0FAD46B0008; Tue, 18 Feb 2020 03:27:26 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 0ABBA6B000A; Tue, 18 Feb 2020 03:27:26 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E19916B000C; Tue, 18 Feb 2020 03:27:25 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0081.hostedemail.com [216.40.44.81]) by kanga.kvack.org (Postfix) with ESMTP id CB7556B0008 for ; Tue, 18 Feb 2020 03:27:25 -0500 (EST) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 6B4122C98 for ; Tue, 18 Feb 2020 08:27:25 +0000 (UTC) X-FDA: 76502568450.25.hand41_40bcc3c882811 X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,ying.huang@intel.com,:peterz@infradead.org:mingo@kernel.org::linux-kernel@vger.kernel.org:feng.tang@intel.com:ying.huang@intel.com:akpm@linux-foundation.org:mhocko@suse.com:riel@redhat.com:mgorman@suse.de:dave.hansen@linux.intel.com:dan.j.williams@intel.com,RULES_HIT:30054:30064:30070:30090,0,RBL:134.134.136.100:@intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:2,LUA_SUMMARY:none X-HE-Tag: hand41_40bcc3c882811 X-Filterd-Recvd-Size: 7274 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf14.hostedemail.com (Postfix) with ESMTP for ; Tue, 18 Feb 2020 08:27:24 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Feb 2020 00:27:24 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,455,1574150400"; d="scan'208";a="235466667" Received: from yhuang-dev.sh.intel.com ([10.239.159.151]) by orsmga003.jf.intel.com with ESMTP; 18 Feb 2020 00:27:21 -0800 From: "Huang, Ying" To: Peter Zijlstra , Ingo Molnar Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Feng Tang , Huang Ying , Andrew Morton , Michal Hocko , Rik van Riel , Mel Gorman , Dave Hansen , Dan Williams Subject: [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM Date: Tue, 18 Feb 2020 16:26:29 +0800 Message-Id: <20200218082634.1596727-4-ying.huang@intel.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200218082634.1596727-1-ying.huang@intel.com> References: <20200218082634.1596727-1-ying.huang@intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Huang Ying In a memory tiering system, if the memory size of the workloads is smaller than that of the faster memory (e.g. DRAM) nodes, all pages of the workloads should be put in the faster memory nodes. But this makes it unnecessary to use slower memory (e.g. PMEM) at all. So in common cases, the memory size of the workload should be larger than that of the faster memory nodes. And to optimize the performance, the hot pages should be promoted to the faster memory nodes while the cold pages should be demoted to the slower memory nodes. To achieve that, we have two choices, a. Promote the hot pages from the slower memory node to the faster memory node. This will create some memory pressure in the faster memory node, thus trigger the memory reclaiming, where the cold pages will be demoted to the slower memory node. b. Demote the cold pages from faster memory node to the slower memory node. This will create some free memory space in the faster memory node, and the hot pages in the slower memory node could be promoted to the faster memory node. The choice "a" will create the memory pressure in the faster memory node. If the memory pressure of the workload is high too, the memory pressure may become so high that the memory allocation latency of the workload is influenced, e.g. the direct reclaiming may be triggered. The choice "b" works much better at this aspect. If the memory pressure of the workload is high, it will consume the free memory and the hot pages promotion will stop earlier if its allocation watermark is higher than that of the normal memory allocation. In this patch, choice "b" is implemented. If memory tiering NUMA balancing mode is enabled, the node isn't the slowest node, and the free memory size of the node is below the high watermark, the kswapd of the node will be waken up to free some memory until the free memory size is above the high watermark + autonuma promotion rate limit. If the free memory size is below the high watermark, autonuma promotion will stop working. This avoids to create too much memory pressure to the system. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Michal Hocko Cc: Rik van Riel Cc: Mel Gorman Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Dave Hansen Cc: Dan Williams Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- mm/migrate.c | 26 +++++++++++++++++--------- mm/vmscan.c | 7 +++++++ 2 files changed, 24 insertions(+), 9 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 0b046759f99a..bbf16764d105 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -48,6 +48,7 @@ #include #include #include +#include #include @@ -1946,8 +1947,7 @@ COMPAT_SYSCALL_DEFINE6(move_pages, pid_t, pid, compat_ulong_t, nr_pages, * Returns true if this is a safe migration target node for misplaced NUMA * pages. Currently it only checks the watermarks which crude */ -static bool migrate_balanced_pgdat(struct pglist_data *pgdat, - unsigned long nr_migrate_pages) +static bool migrate_balanced_pgdat(struct pglist_data *pgdat, int order) { int z; @@ -1958,12 +1958,9 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat, continue; /* Avoid waking kswapd by allocating pages_to_migrate pages. */ - if (!zone_watermark_ok(zone, 0, - high_wmark_pages(zone) + - nr_migrate_pages, - ZONE_MOVABLE, 0)) - continue; - return true; + if (zone_watermark_ok(zone, order, high_wmark_pages(zone), + ZONE_MOVABLE, 0)) + return true; } return false; } @@ -1990,8 +1987,19 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page); /* Avoid migrating to a node that is nearly full */ - if (!migrate_balanced_pgdat(pgdat, compound_nr(page))) + if (!migrate_balanced_pgdat(pgdat, compound_order(page))) { + if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) { + int z; + + for (z = pgdat->nr_zones - 1; z >= 0; z--) { + if (populated_zone(pgdat->node_zones + z)) + break; + } + wakeup_kswapd(pgdat->node_zones + z, + 0, compound_order(page), ZONE_MOVABLE); + } return 0; + } if (isolate_lru_page(page)) return 0; diff --git a/mm/vmscan.c b/mm/vmscan.c index fe90236045d5..b265868d62ef 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -57,6 +57,7 @@ #include #include +#include #include "internal.h" @@ -3462,8 +3463,11 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) { int i; unsigned long mark = -1; + unsigned long promote_ratelimit; struct zone *zone; + promote_ratelimit = sysctl_numa_balancing_rate_limit << + (20 - PAGE_SHIFT); /* * Check watermarks bottom-up as lower zones are more likely to * meet watermarks. @@ -3475,6 +3479,9 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) continue; mark = high_wmark_pages(zone); + if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && + next_migration_node(pgdat->node_id) != -1) + mark += promote_ratelimit; if (zone_watermark_ok_safe(zone, order, mark, classzone_idx)) return true; } From patchwork Tue Feb 18 08:26:30 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 11387985 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3AD7B13A4 for ; Tue, 18 Feb 2020 08:27:30 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 11CA721D7D for ; Tue, 18 Feb 2020 08:27:30 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 11CA721D7D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DF3E06B000A; Tue, 18 Feb 2020 03:27:28 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id DCA636B000C; Tue, 18 Feb 2020 03:27:28 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D07576B000D; Tue, 18 Feb 2020 03:27:28 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0102.hostedemail.com [216.40.44.102]) by kanga.kvack.org (Postfix) with ESMTP id B829F6B000A for ; Tue, 18 Feb 2020 03:27:28 -0500 (EST) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 4E3F0180AD817 for ; Tue, 18 Feb 2020 08:27:28 +0000 (UTC) X-FDA: 76502568576.30.comb55_412b27e685813 X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,ying.huang@intel.com,:peterz@infradead.org:mingo@kernel.org::linux-kernel@vger.kernel.org:feng.tang@intel.com:ying.huang@intel.com:akpm@linux-foundation.org:mhocko@suse.com:riel@redhat.com:mgorman@suse.de:dave.hansen@linux.intel.com:dan.j.williams@intel.com,RULES_HIT:30003:30054:30055:30064:30070,0,RBL:134.134.136.100:@intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:2,LUA_SUMMARY:none X-HE-Tag: comb55_412b27e685813 X-Filterd-Recvd-Size: 5852 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf14.hostedemail.com (Postfix) with ESMTP for ; Tue, 18 Feb 2020 08:27:27 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Feb 2020 00:27:27 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,455,1574150400"; d="scan'208";a="235466678" Received: from yhuang-dev.sh.intel.com ([10.239.159.151]) by orsmga003.jf.intel.com with ESMTP; 18 Feb 2020 00:27:24 -0800 From: "Huang, Ying" To: Peter Zijlstra , Ingo Molnar Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Feng Tang , Huang Ying , Andrew Morton , Michal Hocko , Rik van Riel , Mel Gorman , Dave Hansen , Dan Williams Subject: [RFC -V2 4/8] autonuma, memory tiering: Skip to scan fastest memory Date: Tue, 18 Feb 2020 16:26:30 +0800 Message-Id: <20200218082634.1596727-5-ying.huang@intel.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200218082634.1596727-1-ying.huang@intel.com> References: <20200218082634.1596727-1-ying.huang@intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Huang Ying In memory tiering NUMA balancing mode, the hot pages of the workload in the fastest memory node couldn't be promoted to anywhere, so it's unnecessary to identify the hot pages in the fastest memory node via changing their PTE mapping to have PROT_NONE. So that the page faults could be avoided too. The patch improves the score of pmbench memory accessing benchmark with 80:20 read/write ratio and normal access address distribution by 4.6% on a 2 socket Intel server with Optance DC Persistent Memory. The autonuma hint faults for DRAM node is reduced to almost 0 in the test. Known problem: the statistics of autonuma such as per-node memory accesses, and local/remote ratio, etc. will be influenced. Especially the NUMA scanning period automatic adjustment will not work reasonably. So we cannot rely on that. Fortunately, there's no CPU in the PMEM NUMA nodes, so we will not move tasks there because of the statistics issue. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Michal Hocko Cc: Rik van Riel Cc: Mel Gorman Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Dave Hansen Cc: Dan Williams Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- mm/huge_memory.c | 30 +++++++++++++++++++++--------- mm/mprotect.c | 14 +++++++++++++- 2 files changed, 34 insertions(+), 10 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a88093213674..d45de9b1ead9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #include @@ -1967,17 +1968,28 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, } #endif - /* - * Avoid trapping faults against the zero page. The read-only - * data is likely to be read-cached on the local CPU and - * local/remote hits to the zero page are not interesting. - */ - if (prot_numa && is_huge_zero_pmd(*pmd)) - goto unlock; + if (prot_numa) { + struct page *page; + /* + * Avoid trapping faults against the zero page. The read-only + * data is likely to be read-cached on the local CPU and + * local/remote hits to the zero page are not interesting. + */ + if (is_huge_zero_pmd(*pmd)) + goto unlock; - if (prot_numa && pmd_protnone(*pmd)) - goto unlock; + if (pmd_protnone(*pmd)) + goto unlock; + page = pmd_page(*pmd); + /* + * Skip if normal numa balancing is disabled and no + * faster memory node to promote to + */ + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && + next_promotion_node(page_to_nid(page)) == -1) + goto unlock; + } /* * In case prot_numa, we are under down_read(mmap_sem). It's critical * to not clear pmd intermittently to avoid race with MADV_DONTNEED diff --git a/mm/mprotect.c b/mm/mprotect.c index 7a8e84f86831..7322c98284ac 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include #include @@ -79,6 +80,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, */ if (prot_numa) { struct page *page; + int nid; /* Avoid TLB flush if possible */ if (pte_protnone(oldpte)) @@ -105,7 +107,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, * Don't mess with PTEs if page is already on the node * a single-threaded process is running on. */ - if (target_node == page_to_nid(page)) + nid = page_to_nid(page); + if (target_node == nid) + continue; + + /* + * Skip scanning if normal numa + * balancing is disabled and no faster + * memory node to promote to + */ + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && + next_promotion_node(nid) == -1) continue; } From patchwork Tue Feb 18 08:26:31 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 11387989 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 91BFF13A4 for ; Tue, 18 Feb 2020 08:27:35 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 5A55824656 for ; Tue, 18 Feb 2020 08:27:35 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5A55824656 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 765476B000D; Tue, 18 Feb 2020 03:27:32 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 6C5E06B000E; Tue, 18 Feb 2020 03:27:32 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 53EF76B0010; Tue, 18 Feb 2020 03:27:32 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0129.hostedemail.com [216.40.44.129]) by kanga.kvack.org (Postfix) with ESMTP id 385CE6B000D for ; Tue, 18 Feb 2020 03:27:32 -0500 (EST) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id C8ADB180AD817 for ; Tue, 18 Feb 2020 08:27:31 +0000 (UTC) X-FDA: 76502568702.28.arm31_41ac6cacb5c1b X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,ying.huang@intel.com,:peterz@infradead.org:mingo@kernel.org::linux-kernel@vger.kernel.org:feng.tang@intel.com:ying.huang@intel.com:akpm@linux-foundation.org:mhocko@suse.com:riel@redhat.com:mgorman@suse.de:dave.hansen@linux.intel.com:dan.j.williams@intel.com,RULES_HIT:30003:30034:30036:30054:30055:30064:30070:30071:30075:30091,0,RBL:134.134.136.100:@intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:none,Custom_rules:0:1:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: arm31_41ac6cacb5c1b X-Filterd-Recvd-Size: 8814 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf14.hostedemail.com (Postfix) with ESMTP for ; Tue, 18 Feb 2020 08:27:31 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Feb 2020 00:27:30 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,455,1574150400"; d="scan'208";a="235466684" Received: from yhuang-dev.sh.intel.com ([10.239.159.151]) by orsmga003.jf.intel.com with ESMTP; 18 Feb 2020 00:27:27 -0800 From: "Huang, Ying" To: Peter Zijlstra , Ingo Molnar Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Feng Tang , Huang Ying , Andrew Morton , Michal Hocko , Rik van Riel , Mel Gorman , Dave Hansen , Dan Williams Subject: [RFC -V2 5/8] autonuma, memory tiering: Only promote page if accessed twice Date: Tue, 18 Feb 2020 16:26:31 +0800 Message-Id: <20200218082634.1596727-6-ying.huang@intel.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200218082634.1596727-1-ying.huang@intel.com> References: <20200218082634.1596727-1-ying.huang@intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Huang Ying The original assumption of auto NUMA balancing is that the memory privately or mainly accessed by the CPUs of a NUMA node (called private memory) should fit the memory size of the NUMA node. So if a page is identified to be private memory, it will be migrated to the target node immediately. Eventually all private memory will be migrated. But this assumption isn't true in memory tiering system. In a typical memory tiering system, there are CPUs, fast memory, and slow memory in each physical NUMA node. The CPUs and the fast memory will be put in one logical node (called fast memory node), while the slow memory will be put in another (faked) logical node (called slow memory node). To take full advantage of the system resources, it's common that the size of the private memory of the workload is larger than the memory size of the fast memory node. To resolve the issue, we will try to migrate only the hot pages in the private memory to the fast memory node. A private page that was accessed at least twice in the current and the last scanning periods will be identified as the hot page and migrated. Otherwise, the page isn't considered hot enough to be migrated. To record whether a page is accessed in the last scanning period, the Accessed bit of the PTE/PMD is used. When the page tables are scanned for autonuma, if the pte_protnone(pte) is true, the page isn't accessed in the last scan period, and the Accessed bit will be cleared, otherwise the Accessed bit will be kept. When NUMA page fault occurs, if the Accessed bit is set, the page has been accessed at least twice in the current and the last scanning period and will be migrated. The Accessed bit of PTE/PMD is used by page reclaiming too. So the conflict is possible. Considering the following situation, a) the page is moved from active list to inactive list with Accessed bit cleared b) the page is accessed, so Accessed bit is set c) the page table is scanned by autonuma, PTE is set to PROTNONE+Accessed c) the page isn't accessed d) the page table is scanned by autonuma again, Accessed is cleared e) the inactive list is scanned for reclaiming, the page is reclaimed wrongly because Accessed bit is cleared by autonuma Although the page is reclaimed wrongly, it hasn't been accessed for one numa balancing scanning period at least. So the page isn't so hot too. That is, this shouldn't be a severe issue. The patch improves the score of pmbench memory accessing benchmark with 80:20 read/write ratio and normal access address distribution by 3.1% on a 2 socket Intel server with Optance DC Persistent Memory. In the test, the number of the promoted pages for autonuma reduces 7.2% because the pages fail to pass the twice access checking. Problems: - how to adjust scanning period upon hot page identification requirement. E.g. if the count of page promotion is much larger than free memory, we need to scan faster to identify really hot pages. But his will trigger too many page faults too. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Michal Hocko Cc: Rik van Riel Cc: Mel Gorman Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Dave Hansen Cc: Dan Williams Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- mm/huge_memory.c | 17 ++++++++++++++++- mm/memory.c | 28 +++++++++++++++------------- mm/mprotect.c | 15 ++++++++++++++- 3 files changed, 45 insertions(+), 15 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d45de9b1ead9..8808e50ad921 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1558,6 +1558,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) if (unlikely(!pmd_same(pmd, *vmf->pmd))) goto out_unlock; + /* Only migrate if accessed twice */ + if (!pmd_young(*vmf->pmd)) + goto out_unlock; + /* * If there are potential migrations, wait for completion and retry * without disrupting NUMA hinting information. Do not relock and @@ -1978,8 +1982,19 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (is_huge_zero_pmd(*pmd)) goto unlock; - if (pmd_protnone(*pmd)) + if (pmd_protnone(*pmd)) { + if (!(sysctl_numa_balancing_mode & + NUMA_BALANCING_MEMORY_TIERING)) + goto unlock; + + /* + * PMD young bit is used to record whether the + * page is accessed in last scan period + */ + if (pmd_young(*pmd)) + set_pmd_at(mm, addr, pmd, pmd_mkold(*pmd)); goto unlock; + } page = pmd_page(*pmd); /* diff --git a/mm/memory.c b/mm/memory.c index 45442d9a4f52..afb4c55cb278 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3811,10 +3811,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) */ vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd); spin_lock(vmf->ptl); - if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - goto out; - } + if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) + goto unmap_out; /* * Make it present again, Depending on how arch implementes non @@ -3828,17 +3826,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); update_mmu_cache(vma, vmf->address, vmf->pte); + /* Only migrate if accessed twice */ + if (!pte_young(old_pte)) + goto unmap_out; + page = vm_normal_page(vma, vmf->address, pte); - if (!page) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; - } + if (!page) + goto unmap_out; /* TODO: handle PTE-mapped THP */ - if (PageCompound(page)) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; - } + if (PageCompound(page)) + goto unmap_out; /* * Avoid grouping on RO pages in general. RO pages shouldn't hurt as @@ -3876,10 +3874,14 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) } else flags |= TNF_MIGRATE_FAIL; -out: if (page_nid != NUMA_NO_NODE) task_numa_fault(last_cpupid, page_nid, 1, flags); return 0; + +unmap_out: + pte_unmap_unlock(vmf->pte, vmf->ptl); +out: + return 0; } static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) diff --git a/mm/mprotect.c b/mm/mprotect.c index 7322c98284ac..1948105d23d5 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -83,8 +83,21 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, int nid; /* Avoid TLB flush if possible */ - if (pte_protnone(oldpte)) + if (pte_protnone(oldpte)) { + if (!(sysctl_numa_balancing_mode & + NUMA_BALANCING_MEMORY_TIERING)) + continue; + + /* + * PTE young bit is used to record + * whether the page is accessed in + * last scan period + */ + if (pte_young(oldpte)) + set_pte_at(vma->vm_mm, addr, pte, + pte_mkold(oldpte)); continue; + } page = vm_normal_page(vma, addr, oldpte); if (!page || PageKsm(page)) From patchwork Tue Feb 18 08:26:32 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 11387991 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6E71A13A4 for ; Tue, 18 Feb 2020 08:27:38 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2657D2464E for ; Tue, 18 Feb 2020 08:27:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2657D2464E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5DB2E6B000E; Tue, 18 Feb 2020 03:27:36 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 58D9E6B0010; Tue, 18 Feb 2020 03:27:36 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4A3066B0032; Tue, 18 Feb 2020 03:27:36 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0037.hostedemail.com [216.40.44.37]) by kanga.kvack.org (Postfix) with ESMTP id 2E9A36B000E for ; Tue, 18 Feb 2020 03:27:36 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id B8B4F40D3 for ; Tue, 18 Feb 2020 08:27:35 +0000 (UTC) X-FDA: 76502568870.13.wound81_423d4887a4726 X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,ying.huang@intel.com,:peterz@infradead.org:mingo@kernel.org::linux-kernel@vger.kernel.org:feng.tang@intel.com:ying.huang@intel.com:jianshi.zhou@intel.com:fengguang.wu@intel.com:akpm@linux-foundation.org:mhocko@suse.com:riel@redhat.com:mgorman@suse.de:dave.hansen@linux.intel.com:dan.j.williams@intel.com,RULES_HIT:30003:30034:30036:30051:30054:30055:30064:30070:30075:30080:30090,0,RBL:134.134.136.100:@intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:1,LUA_SUMMARY:none X-HE-Tag: wound81_423d4887a4726 X-Filterd-Recvd-Size: 19403 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf14.hostedemail.com (Postfix) with ESMTP for ; Tue, 18 Feb 2020 08:27:35 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Feb 2020 00:27:34 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,455,1574150400"; d="scan'208";a="235466702" Received: from yhuang-dev.sh.intel.com ([10.239.159.151]) by orsmga003.jf.intel.com with ESMTP; 18 Feb 2020 00:27:31 -0800 From: "Huang, Ying" To: Peter Zijlstra , Ingo Molnar Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Feng Tang , Huang Ying , "Zhou, Jianshi" , Fengguang Wu , Andrew Morton , Michal Hocko , Rik van Riel , Mel Gorman , Dave Hansen , Dan Williams Subject: [RFC -V2 6/8] autonuma, memory tiering: Select hotter pages to promote to fast memory node Date: Tue, 18 Feb 2020 16:26:32 +0800 Message-Id: <20200218082634.1596727-7-ying.huang@intel.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200218082634.1596727-1-ying.huang@intel.com> References: <20200218082634.1596727-1-ying.huang@intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Huang Ying In memory tiering system, to maximize the overall system performance, the hot pages should be put in the fast memory node while the cold pages should be put in the slow memory node. In original memory tiering autonuma implementation, we will try to promote almost all recently accessed pages, and use the LRU algorithm in page reclaiming to keep the hot pages in the fast memory node and demote the cold pages to the slow memory node. The problem of this solution is that the cold pages with a low access frequency may be promoted then demoted too. So that the memory bandwidth is wasted. And because migration is rate-limited, the hot pages need to compete with the cold pages for the limited migration bandwidth. If we could select the hotter pages to promote to the fast memory node in the first place, then the wasted migration bandwidth would be reduced and the hot pages would be promoted more quickly. The patch "autonuma, memory tiering: Only promote page if accessed twice" in the series will prevent the really cold pages that are not accessed in the last scan period from being promoted. But the scan period could be as long as tens seconds, so it doesn't work well enough on selecting the hotter pages. To identify the hotter pages, in this patch we implemented a method suggested by Jianshi and Fengguang. Which is based on autonuma page table scanning and hint page fault as follows, - When a range of the page table is scanned in autonuma, the timestamp and the address range is recorded in a ring buffer in struct mm_struct. So we have information of recent N scans. - When the autonuma hint page fault occurs, the fault address is searched in the ring buffer to get its scanning timestamp. The hint page fault latency is defined as hint page fault timestamp - scan timestamp If the access frequency of the hotter pages is higher, the probability for their hint page fault latency to be shorter is higher too. So the hint page fault latency is a good estimation of the page heat. The size of ring buffer should record NUMA scanning history reasonably long. From task_scan_min(), the minimal interval between task_numa_work() running is about 100 ms by default. So we can keep 1600 ms history by default if set the size to 16. If user choose to use smaller sysctl_numa_balancing_scan_size, then we can only keep shorter history. In general, we want to keep no less than 1000 ms history. So 16 seems a reasonable choice. The remaining problem is how to determine the hot threshold. It's not easy to be done automatically. So we provide a sysctl knob: kernel.numa_balancing_hot_threshold_ms. All pages with hint page fault latency < the threshold will be considered hot. The system administrator can determine the hot threshold via various information, such as PMEM bandwidth limit, the average number of the pages pass the hot threshold, etc. The default hot threshold is 1 second, which works well in our performance test. The patch improves the score of pmbench memory accessing benchmark with 80:20 read/write ratio and normal access address distribution by 9.2% with 50.3% fewer NUMA page migrations on a 2 socket Intel server with Optance DC Persistent Memory. That is, the cost of autonuma page migration reduces considerably. The downside of the patch is that the response time to the workload hot spot changing may be much longer. For example, - A previous cold memory area becomes hot - The hint page fault will be triggered. But the hint page fault latency may not be shorter than the hot threshold. So the pages may not be promoted. - When the memory area is scanned again, maybe after a scan period, the hint page fault latency measured will be shorter than the hot threshold and the pages will be promoted. To mitigate this, - If there are enough free space in the fast memory node (> high watermark + 2 * promotion rate limit), the hot threshold will not be used, all pages will be promoted upon the hint page fault for fast response. - If fast response is more important for system performance, the administrator can set a higher hot threshold. Signed-off-by: "Huang, Ying" Suggested-by: "Zhou, Jianshi" Suggested-by: Fengguang Wu Cc: Andrew Morton Cc: Michal Hocko Cc: Rik van Riel Cc: Mel Gorman Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Dave Hansen Cc: Dan Williams Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/mempolicy.h | 5 +- include/linux/mm_types.h | 9 ++++ include/linux/sched/numa_balancing.h | 8 ++- include/linux/sched/sysctl.h | 1 + kernel/sched/fair.c | 78 +++++++++++++++++++++++++--- kernel/sysctl.c | 7 +++ mm/huge_memory.c | 6 +-- mm/memory.c | 7 ++- mm/mempolicy.c | 7 ++- 9 files changed, 108 insertions(+), 20 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 5228c62af416..674aaa7614ed 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -202,7 +202,8 @@ static inline bool vma_migratable(struct vm_area_struct *vma) return true; } -extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long); +extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long, + int flags); extern void mpol_put_task_policy(struct task_struct *); #else @@ -300,7 +301,7 @@ static inline int mpol_parse_str(char *str, struct mempolicy **mpol) #endif static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma, - unsigned long address) + unsigned long address, int flags) { return -1; /* no node preference */ } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 270aa8fd2800..2fed3d92bbc1 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -508,6 +508,15 @@ struct mm_struct { /* numa_scan_seq prevents two threads setting pte_numa */ int numa_scan_seq; + + /* + * Keep 1600ms history of NUMA scanning, when default + * 100ms minimal scanning interval is used. + */ +#define NUMA_SCAN_NR_HIST 16 + int numa_scan_idx; + unsigned long numa_scan_jiffies[NUMA_SCAN_NR_HIST]; + unsigned long numa_scan_starts[NUMA_SCAN_NR_HIST]; #endif /* * An operation with batched TLB flushing is going on. Anything diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h index 3988762efe15..4899ec000245 100644 --- a/include/linux/sched/numa_balancing.h +++ b/include/linux/sched/numa_balancing.h @@ -14,6 +14,7 @@ #define TNF_SHARED 0x04 #define TNF_FAULT_LOCAL 0x08 #define TNF_MIGRATE_FAIL 0x10 +#define TNF_YOUNG 0x20 #ifdef CONFIG_NUMA_BALANCING extern void task_numa_fault(int last_node, int node, int pages, int flags); @@ -21,7 +22,8 @@ extern pid_t task_numa_group_id(struct task_struct *p); extern void set_numabalancing_state(bool enabled); extern void task_numa_free(struct task_struct *p, bool final); extern bool should_numa_migrate_memory(struct task_struct *p, struct page *page, - int src_nid, int dst_cpu); + int src_nid, int dst_cpu, + unsigned long addr, int flags); #else static inline void task_numa_fault(int last_node, int node, int pages, int flags) @@ -38,7 +40,9 @@ static inline void task_numa_free(struct task_struct *p, bool final) { } static inline bool should_numa_migrate_memory(struct task_struct *p, - struct page *page, int src_nid, int dst_cpu) + struct page *page, int src_nid, + int dst_cpu, unsigned long addr, + int flags) { return true; } diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index c4b27790b901..c207709ff498 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -42,6 +42,7 @@ extern unsigned int sysctl_numa_balancing_scan_delay; extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; +extern unsigned int sysctl_numa_balancing_hot_threshold; #ifdef CONFIG_NUMA_BALANCING extern unsigned int sysctl_numa_balancing_rate_limit; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ef694816150b..773f3220efc4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1070,6 +1070,9 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000; */ unsigned int sysctl_numa_balancing_rate_limit; +/* The page with hint page fault latency < threshold in ms is considered hot */ +unsigned int sysctl_numa_balancing_hot_threshold = 1000; + struct numa_group { refcount_t refcount; @@ -1430,6 +1433,43 @@ static bool pgdat_free_space_enough(struct pglist_data *pgdat) return false; } +static long numa_hint_fault_latency(struct task_struct *p, unsigned long addr) +{ + struct mm_struct *mm = p->mm; + unsigned long now = jiffies; + unsigned long start, end; + int i, j; + long latency = 0; + + /* + * Paired with smp_store_release() in task_numa_work() to check + * scan range buffer after get current index + */ + i = smp_load_acquire(&mm->numa_scan_idx); + i = (i - 1) % NUMA_SCAN_NR_HIST; + + end = READ_ONCE(mm->numa_scan_offset); + start = READ_ONCE(mm->numa_scan_starts[i]); + if (start == end) + end = start + MAX_SCAN_WINDOW * (1UL << 22); + for (j = 0; j < NUMA_SCAN_NR_HIST; j++) { + latency = now - READ_ONCE(mm->numa_scan_jiffies[i]); + start = READ_ONCE(mm->numa_scan_starts[i]); + /* Scan pass the end of address space */ + if (end < start) + end = TASK_SIZE; + if (addr >= start && addr < end) + return latency; + end = start; + i = (i - 1) % NUMA_SCAN_NR_HIST; + } + /* + * The tracking window isn't large enough, approximate to the + * max latency in the tracking window. + */ + return latency; +} + static bool numa_migration_check_rate_limit(struct pglist_data *pgdat, unsigned long rate_limit, int nr) { @@ -1448,7 +1488,8 @@ static bool numa_migration_check_rate_limit(struct pglist_data *pgdat, } bool should_numa_migrate_memory(struct task_struct *p, struct page * page, - int src_nid, int dst_cpu) + int src_nid, int dst_cpu, unsigned long addr, + int flags) { struct numa_group *ng = deref_curr_numa_group(p); int dst_nid = cpu_to_node(dst_cpu); @@ -1461,12 +1502,21 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && next_promotion_node(src_nid) != -1) { struct pglist_data *pgdat; - unsigned long rate_limit; + unsigned long rate_limit, latency, th; pgdat = NODE_DATA(dst_nid); if (pgdat_free_space_enough(pgdat)) return true; + /* The page hasn't been accessed in the last scan period */ + if (!(flags & TNF_YOUNG)) + return false; + + th = msecs_to_jiffies(sysctl_numa_balancing_hot_threshold); + latency = numa_hint_fault_latency(p, addr); + if (latency > th) + return false; + rate_limit = sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT); return numa_migration_check_rate_limit(pgdat, rate_limit, @@ -2540,7 +2590,7 @@ static void reset_ptenuma_scan(struct task_struct *p) * expensive, to avoid any form of compiler optimizations: */ WRITE_ONCE(p->mm->numa_scan_seq, READ_ONCE(p->mm->numa_scan_seq) + 1); - p->mm->numa_scan_offset = 0; + WRITE_ONCE(p->mm->numa_scan_offset, 0); } /* @@ -2557,6 +2607,7 @@ static void task_numa_work(struct callback_head *work) unsigned long start, end; unsigned long nr_pte_updates = 0; long pages, virtpages; + int idx; SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work)); @@ -2615,6 +2666,15 @@ static void task_numa_work(struct callback_head *work) start = 0; vma = mm->mmap; } + idx = mm->numa_scan_idx; + WRITE_ONCE(mm->numa_scan_starts[idx], start); + WRITE_ONCE(mm->numa_scan_jiffies[idx], jiffies); + /* + * Paired with smp_load_acquire() in numa_hint_fault_latency() + * to update scan range buffer index after update the buffer + * contents. + */ + smp_store_release(&mm->numa_scan_idx, (idx + 1) % NUMA_SCAN_NR_HIST); for (; vma; vma = vma->vm_next) { if (!vma_migratable(vma) || !vma_policy_mof(vma) || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) { @@ -2642,6 +2702,7 @@ static void task_numa_work(struct callback_head *work) start = max(start, vma->vm_start); end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE); end = min(end, vma->vm_end); + WRITE_ONCE(mm->numa_scan_offset, end); nr_pte_updates = change_prot_numa(vma, start, end); /* @@ -2671,9 +2732,7 @@ static void task_numa_work(struct callback_head *work) * would find the !migratable VMA on the next scan but not reset the * scanner to the start so check it now. */ - if (vma) - mm->numa_scan_offset = start; - else + if (!vma) reset_ptenuma_scan(p); up_read(&mm->mmap_sem); @@ -2691,7 +2750,7 @@ static void task_numa_work(struct callback_head *work) void init_numa_balancing(unsigned long clone_flags, struct task_struct *p) { - int mm_users = 0; + int i, mm_users = 0; struct mm_struct *mm = p->mm; if (mm) { @@ -2699,6 +2758,11 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p) if (mm_users == 1) { mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); mm->numa_scan_seq = 0; + mm->numa_scan_idx = 0; + for (i = 0; i < NUMA_SCAN_NR_HIST; i++) { + mm->numa_scan_jiffies[i] = 0; + mm->numa_scan_starts[i] = 0; + } } } p->node_stamp = 0; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 2d19e821267a..da1fc0303cca 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -427,6 +427,13 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, }, + { + .procname = "numa_balancing_hot_threshold_ms", + .data = &sysctl_numa_balancing_hot_threshold, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, { .procname = "numa_balancing", .data = &sysctl_numa_balancing_mode, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8808e50ad921..08d25763e65f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1559,8 +1559,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) goto out_unlock; /* Only migrate if accessed twice */ - if (!pmd_young(*vmf->pmd)) - goto out_unlock; + if (pmd_young(*vmf->pmd)) + flags |= TNF_YOUNG; /* * If there are potential migrations, wait for completion and retry @@ -1595,7 +1595,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) * page_table_lock if at all possible */ page_locked = trylock_page(page); - target_nid = mpol_misplaced(page, vma, haddr); + target_nid = mpol_misplaced(page, vma, haddr, flags); if (target_nid == NUMA_NO_NODE) { /* If the page was locked, there are no parallel migrations */ if (page_locked) diff --git a/mm/memory.c b/mm/memory.c index afb4c55cb278..207caa9e61da 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3789,7 +3789,7 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, *flags |= TNF_FAULT_LOCAL; } - return mpol_misplaced(page, vma, addr); + return mpol_misplaced(page, vma, addr, *flags); } static vm_fault_t do_numa_page(struct vm_fault *vmf) @@ -3826,9 +3826,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); update_mmu_cache(vma, vmf->address, vmf->pte); - /* Only migrate if accessed twice */ - if (!pte_young(old_pte)) - goto unmap_out; + if (pte_young(old_pte)) + flags |= TNF_YOUNG; page = vm_normal_page(vma, vmf->address, pte); if (!page) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 22b4d1a0ea53..4f9301195de5 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2394,6 +2394,7 @@ static void sp_free(struct sp_node *n) * @page: page to be checked * @vma: vm area where page mapped * @addr: virtual address where page mapped + * @flags: numa balancing flags * * Lookup current policy node id for vma,addr and "compare to" page's * node id. @@ -2405,7 +2406,8 @@ static void sp_free(struct sp_node *n) * Policy determination "mimics" alloc_page_vma(). * Called from fault path where we know the vma and faulting address. */ -int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr) +int mpol_misplaced(struct page *page, struct vm_area_struct *vma, + unsigned long addr, int flags) { struct mempolicy *pol; struct zoneref *z; @@ -2459,7 +2461,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long if (pol->flags & MPOL_F_MORON) { polnid = thisnid; - if (!should_numa_migrate_memory(current, page, curnid, thiscpu)) + if (!should_numa_migrate_memory(current, page, curnid, + thiscpu, addr, flags)) goto out; } From patchwork Tue Feb 18 08:26:33 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 11387993 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 419311580 for ; Tue, 18 Feb 2020 08:27:41 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 18A57206E2 for ; Tue, 18 Feb 2020 08:27:41 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 18A57206E2 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0D34C6B0010; Tue, 18 Feb 2020 03:27:39 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 0842E6B0032; Tue, 18 Feb 2020 03:27:39 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EDE106B0036; Tue, 18 Feb 2020 03:27:38 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0087.hostedemail.com [216.40.44.87]) by kanga.kvack.org (Postfix) with ESMTP id D4B496B0010 for ; Tue, 18 Feb 2020 03:27:38 -0500 (EST) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 736EC181AEF1D for ; Tue, 18 Feb 2020 08:27:38 +0000 (UTC) X-FDA: 76502568996.24.bears08_42a6c9c28932c X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,ying.huang@intel.com,:peterz@infradead.org:mingo@kernel.org::linux-kernel@vger.kernel.org:feng.tang@intel.com:ying.huang@intel.com:akpm@linux-foundation.org:mhocko@suse.com:riel@redhat.com:mgorman@suse.de:dave.hansen@linux.intel.com:dan.j.williams@intel.com,RULES_HIT:30003:30054:30064,0,RBL:134.134.136.100:@intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: bears08_42a6c9c28932c X-Filterd-Recvd-Size: 3928 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf14.hostedemail.com (Postfix) with ESMTP for ; Tue, 18 Feb 2020 08:27:37 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Feb 2020 00:27:37 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,455,1574150400"; d="scan'208";a="235466721" Received: from yhuang-dev.sh.intel.com ([10.239.159.151]) by orsmga003.jf.intel.com with ESMTP; 18 Feb 2020 00:27:35 -0800 From: "Huang, Ying" To: Peter Zijlstra , Ingo Molnar Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Feng Tang , Huang Ying , Andrew Morton , Michal Hocko , Rik van Riel , Mel Gorman , Dave Hansen , Dan Williams Subject: [RFC -V2 7/8] autonuma, memory tiering: Double hot threshold for write hint page fault Date: Tue, 18 Feb 2020 16:26:33 +0800 Message-Id: <20200218082634.1596727-8-ying.huang@intel.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200218082634.1596727-1-ying.huang@intel.com> References: <20200218082634.1596727-1-ying.huang@intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Huang Ying The write performance of PMEM is much worse than its read performance. So even if a write-mostly pages is colder than a read-mostly pages, it is usually better to put the write-mostly pages in DRAM and read-mostly pages in PMEM. To give write-mostly pages more opportunity to be promoted to DRAM, in this patch, the hot threshold for write hint page fault is doubled (easier to be promoted). Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Michal Hocko Cc: Rik van Riel Cc: Mel Gorman Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Dave Hansen Cc: Dan Williams Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/sched/numa_balancing.h | 1 + kernel/sched/fair.c | 2 ++ mm/memory.c | 3 +++ 3 files changed, 6 insertions(+) diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h index 4899ec000245..518b3d6143ba 100644 --- a/include/linux/sched/numa_balancing.h +++ b/include/linux/sched/numa_balancing.h @@ -15,6 +15,7 @@ #define TNF_FAULT_LOCAL 0x08 #define TNF_MIGRATE_FAIL 0x10 #define TNF_YOUNG 0x20 +#define TNF_WRITE 0x40 #ifdef CONFIG_NUMA_BALANCING extern void task_numa_fault(int last_node, int node, int pages, int flags); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 773f3220efc4..e5f7f4139c82 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1513,6 +1513,8 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, return false; th = msecs_to_jiffies(sysctl_numa_balancing_hot_threshold); + if (flags & TNF_WRITE) + th *= 2; latency = numa_hint_fault_latency(p, addr); if (latency > th) return false; diff --git a/mm/memory.c b/mm/memory.c index 207caa9e61da..595d3cd62f61 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3829,6 +3829,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) if (pte_young(old_pte)) flags |= TNF_YOUNG; + if (vmf->flags & FAULT_FLAG_WRITE) + flags |= TNF_WRITE; + page = vm_normal_page(vma, vmf->address, pte); if (!page) goto unmap_out; From patchwork Tue Feb 18 08:26:34 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 11387995 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5D2A11580 for ; Tue, 18 Feb 2020 08:27:44 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1E1172465A for ; Tue, 18 Feb 2020 08:27:44 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1E1172465A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B80BD6B0032; Tue, 18 Feb 2020 03:27:42 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id B32D36B0036; Tue, 18 Feb 2020 03:27:42 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A6F8A6B0037; Tue, 18 Feb 2020 03:27:42 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0089.hostedemail.com [216.40.44.89]) by kanga.kvack.org (Postfix) with ESMTP id 9031D6B0032 for ; Tue, 18 Feb 2020 03:27:42 -0500 (EST) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 2D44C4995F7 for ; Tue, 18 Feb 2020 08:27:42 +0000 (UTC) X-FDA: 76502569164.18.brush19_432b12b982034 X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,ying.huang@intel.com,:peterz@infradead.org:mingo@kernel.org::linux-kernel@vger.kernel.org:feng.tang@intel.com:ying.huang@intel.com:akpm@linux-foundation.org:mhocko@suse.com:riel@redhat.com:mgorman@suse.de:dave.hansen@linux.intel.com:dan.j.williams@intel.com,RULES_HIT:30003:30034:30054:30055:30064:30070,0,RBL:134.134.136.100:@intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: brush19_432b12b982034 X-Filterd-Recvd-Size: 6976 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf14.hostedemail.com (Postfix) with ESMTP for ; Tue, 18 Feb 2020 08:27:41 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Feb 2020 00:27:41 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,455,1574150400"; d="scan'208";a="235466738" Received: from yhuang-dev.sh.intel.com ([10.239.159.151]) by orsmga003.jf.intel.com with ESMTP; 18 Feb 2020 00:27:38 -0800 From: "Huang, Ying" To: Peter Zijlstra , Ingo Molnar Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Feng Tang , Huang Ying , Andrew Morton , Michal Hocko , Rik van Riel , Mel Gorman , Dave Hansen , Dan Williams Subject: [RFC -V2 8/8] autonuma, memory tiering: Adjust hot threshold automatically Date: Tue, 18 Feb 2020 16:26:34 +0800 Message-Id: <20200218082634.1596727-9-ying.huang@intel.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200218082634.1596727-1-ying.huang@intel.com> References: <20200218082634.1596727-1-ying.huang@intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Huang Ying It isn't easy for the administrator to determine the hot threshold. So in this patch, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit. If the hint page fault latency of a page is less than the hot threshold, we will try to promote the page, that is, the page is the candidate promotion page. If the number of the candidate promotion pages in the statistics interval is much higher than the promotion rate limit, the hot threshold will be lowered to reduce the number of the candidate promotion pages. Otherwise, the hot threshold will be raised to increase the number of the candidate promotion pages. To make the above method works, in each statistics interval, the total number of the pages to check (on which the hint page faults occur) and the hot/cold distribution need to be stable. Because the page tables are scanned linearly in autonuma, but the hot/cold distribution isn't uniform along the address. The statistics interval should be larger than the autonuma scan period. So in the patch, the max scan period is used as statistics interval and it works well in our tests. The sysctl knob kernel.numa_balancing_hot_threshold_ms becomes the initial value and max value of the hot threshold. The patch improves the score of pmbench memory accessing benchmark with 80:20 read/write ratio and normal access address distribution by 5.5% with 24.6% fewer NUMA page migrations on a 2 socket Intel server with Optance DC Persistent Memory. Because it improves the accuracy of the hot page selection. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Michal Hocko Cc: Rik van Riel Cc: Mel Gorman Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Dave Hansen Cc: Dan Williams Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/mmzone.h | 3 +++ kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++++++---- 2 files changed, 39 insertions(+), 4 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6e7a28becdc2..4ed82eb5c8b5 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -792,6 +792,9 @@ typedef struct pglist_data { #ifdef CONFIG_NUMA_BALANCING unsigned long numa_ts; unsigned long numa_try; + unsigned long numa_threshold_ts; + unsigned long numa_threshold_try; + unsigned long numa_threshold; #endif /* Fields commonly accessed by the page reclaim scanner */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e5f7f4139c82..90098c35d336 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1487,6 +1487,35 @@ static bool numa_migration_check_rate_limit(struct pglist_data *pgdat, return true; } +#define NUMA_MIGRATION_ADJUST_STEPS 16 + +static void numa_migration_adjust_threshold(struct pglist_data *pgdat, + unsigned long rate_limit, + unsigned long ref_th) +{ + unsigned long now = jiffies, last_th_ts, th_period; + unsigned long unit_th, th; + unsigned long try, ref_try, tdiff; + + th_period = msecs_to_jiffies(sysctl_numa_balancing_scan_period_max); + last_th_ts = pgdat->numa_threshold_ts; + if (now > last_th_ts + th_period && + cmpxchg(&pgdat->numa_threshold_ts, last_th_ts, now) == last_th_ts) { + ref_try = rate_limit * + sysctl_numa_balancing_scan_period_max / 1000; + try = node_page_state(pgdat, NUMA_TRY_MIGRATE); + tdiff = try - pgdat->numa_threshold_try; + unit_th = ref_th / NUMA_MIGRATION_ADJUST_STEPS; + th = pgdat->numa_threshold ? : ref_th; + if (tdiff > ref_try * 11 / 10) + th = max(th - unit_th, unit_th); + else if (tdiff < ref_try * 9 / 10) + th = min(th + unit_th, ref_th); + pgdat->numa_threshold_try = try; + pgdat->numa_threshold = th; + } +} + bool should_numa_migrate_memory(struct task_struct *p, struct page * page, int src_nid, int dst_cpu, unsigned long addr, int flags) @@ -1502,7 +1531,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && next_promotion_node(src_nid) != -1) { struct pglist_data *pgdat; - unsigned long rate_limit, latency, th; + unsigned long rate_limit, latency, th, def_th; pgdat = NODE_DATA(dst_nid); if (pgdat_free_space_enough(pgdat)) @@ -1512,15 +1541,18 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, if (!(flags & TNF_YOUNG)) return false; - th = msecs_to_jiffies(sysctl_numa_balancing_hot_threshold); + def_th = msecs_to_jiffies(sysctl_numa_balancing_hot_threshold); + rate_limit = + sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT); + numa_migration_adjust_threshold(pgdat, rate_limit, def_th); + + th = pgdat->numa_threshold ? : def_th; if (flags & TNF_WRITE) th *= 2; latency = numa_hint_fault_latency(p, addr); if (latency > th) return false; - rate_limit = - sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT); return numa_migration_check_rate_limit(pgdat, rate_limit, hpage_nr_pages(page)); }