From patchwork Tue Jul 16 03:39:29 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
X-Patchwork-Id: 11045191
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 65A7F1510
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 16 Jul 2019 03:39:52 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 55FB328451
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 16 Jul 2019 03:39:52 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 4A11B28553; Tue, 16 Jul 2019 03:39:52 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64,
	MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI,TVD_PH_BODY_ACCOUNTS_PRE,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 904F528451
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 16 Jul 2019 03:39:51 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1730964AbfGPDjp (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Mon, 15 Jul 2019 23:39:45 -0400
Received: from out4436.biz.mail.alibaba.com ([47.88.44.36]:19655 "EHLO
        out4436.biz.mail.alibaba.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1729574AbfGPDjp (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Mon, 15 Jul 2019 23:39:45 -0400
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07417;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0TX1Yc3G_1563248369;
Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com
 fp:SMTPD_---0TX1Yc3G_1563248369)
          by smtp.aliyun-inc.com(127.0.0.1);
          Tue, 16 Jul 2019 11:39:30 +0800
Subject: [PATCH v2 1/4] numa: introduce per-cgroup numa balancing locality
 statistic
From: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
To: Peter Zijlstra <peterz@infradead.org>, hannes@cmpxchg.org,
        mhocko@kernel.org, vdavydov.dev@gmail.com,
        Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, mcgrof@kernel.org,
 keescook@chromium.org, linux-fsdevel@vger.kernel.org,
 cgroups@vger.kernel.org, =?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Hillf Danton <hdanton@sina.com>
References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com>
 <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com>
 <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com>
Message-ID: <120ffcaa-0281-5d30-c0c1-9464d93e935f@linux.alibaba.com>
Date: Tue, 16 Jul 2019 11:39:29 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0)
 Gecko/20100101 Thunderbird/60.7.0
MIME-Version: 1.0
In-Reply-To: <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com>
Content-Language: en-US
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

This patch introduced numa locality statistic, which try to imply
the numa balancing efficiency per memory cgroup.

On numa balancing, we trace the local page accessing ratio of tasks,
which we call the locality.

By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we
see output line heading with 'locality', like:

  locality 15393 21259 13023 44461 21247 17012 28496 145402

locality divided into 8 regions, each number standing for the micro
seconds we hit a task running with the locality within that region,
for example here we have tasks with locality around 0~12% running for
15393 ms, and tasks with locality around 88~100% running for 145402 ms.

By monitoring the increment, we can check if the workloads of a
particular cgroup is doing well with numa, when most of the tasks are
running in low locality region, then something is wrong with your numa
policy.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
Since v1:
  * move implementation from memory cgroup into cpu group
  * introduce new entry 'numa_stat' to present locality
  * locality now accounting in hierarchical way
  * locality now accounted into 8 regions equally

 include/linux/sched.h |  8 +++++++-
 kernel/sched/core.c   | 40 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/debug.c  |  7 +++++++
 kernel/sched/fair.c   | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  | 29 +++++++++++++++++++++++++++++
 5 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 907808f1acc5..eb26098de6ea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1117,8 +1117,14 @@ struct task_struct {
 	 * scan window were remote/local or failed to migrate. The task scan
 	 * period is adapted based on the locality of the faults with different
 	 * weights depending on whether they were shared or private faults
+	 *
+	 * 0 -- remote faults
+	 * 1 -- local faults
+	 * 2 -- page migration failure
+	 * 3 -- remote page accessing
+	 * 4 -- local page accessing
 	 */
-	unsigned long			numa_faults_locality[3];
+	unsigned long			numa_faults_locality[5];

 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fa43ce3962e7..71a8d3ed8495 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6367,6 +6367,10 @@ static struct kmem_cache *task_group_cache __read_mostly;
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
 DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);

+#ifdef CONFIG_NUMA_BALANCING
+DECLARE_PER_CPU(struct numa_stat, root_numa_stat);
+#endif
+
 void __init sched_init(void)
 {
 	unsigned long alloc_size = 0, ptr;
@@ -6416,6 +6420,10 @@ void __init sched_init(void)
 	init_defrootdomain();
 #endif

+#ifdef CONFIG_NUMA_BALANCING
+	root_task_group.numa_stat = &root_numa_stat;
+#endif
+
 #ifdef CONFIG_RT_GROUP_SCHED
 	init_rt_bandwidth(&root_task_group.rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
@@ -6727,6 +6735,7 @@ static DEFINE_SPINLOCK(task_group_lock);

 static void sched_free_group(struct task_group *tg)
 {
+	free_tg_numa_stat(tg);
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
@@ -6742,6 +6751,9 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!tg)
 		return ERR_PTR(-ENOMEM);

+	if (!alloc_tg_numa_stat(tg))
+		goto err;
+
 	if (!alloc_fair_sched_group(tg, parent))
 		goto err;

@@ -7277,6 +7289,28 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_NUMA_BALANCING
+static int cpu_numa_stat_show(struct seq_file *sf, void *v)
+{
+	int nr;
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	seq_puts(sf, "locality");
+	for (nr = 0; nr < NR_NL_INTERVAL; nr++) {
+		int cpu;
+		u64 sum = 0;
+
+		for_each_possible_cpu(cpu)
+			sum += per_cpu(tg->numa_stat->locality[nr], cpu);
+
+		seq_printf(sf, " %u", jiffies_to_msecs(sum));
+	}
+	seq_putc(sf, '\n');
+
+	return 0;
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -7312,6 +7346,12 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_rt_period_read_uint,
 		.write_u64 = cpu_rt_period_write_uint,
 	},
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+	{
+		.name = "numa_stat",
+		.seq_show = cpu_numa_stat_show,
+	},
 #endif
 	{ }	/* Terminate */
 };
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f7e4579e746c..a22b2a62aee2 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -848,6 +848,13 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
 	P(total_numa_faults);
 	SEQ_printf(m, "current_node=%d, numa_group_id=%d\n",
 			task_node(p), task_numa_group_id(p));
+	SEQ_printf(m, "faults_locality local=%lu remote=%lu failed=%lu ",
+			p->numa_faults_locality[1],
+			p->numa_faults_locality[0],
+			p->numa_faults_locality[2]);
+	SEQ_printf(m, "lhit=%lu rhit=%lu\n",
+			p->numa_faults_locality[4],
+			p->numa_faults_locality[3]);
 	show_numa_stats(p, m);
 	mpol_put(pol);
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 036be95a87e9..cd716355d70e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2449,6 +2449,12 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
 	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
 	p->numa_faults_locality[local] += pages;
+	/*
+	 * We want to have the real local/remote page access statistic
+	 * here, so use 'mem_node' which is the real residential node of
+	 * page after migrate_misplaced_page().
+	 */
+	p->numa_faults_locality[3 + !!(mem_node == numa_node_id())] += pages;
 }

 static void reset_ptenuma_scan(struct task_struct *p)
@@ -2611,6 +2617,47 @@ void task_numa_work(struct callback_head *work)
 	}
 }

+DEFINE_PER_CPU(struct numa_stat, root_numa_stat);
+
+int alloc_tg_numa_stat(struct task_group *tg)
+{
+	tg->numa_stat = alloc_percpu(struct numa_stat);
+	if (!tg->numa_stat)
+		return 0;
+
+	return 1;
+}
+
+void free_tg_numa_stat(struct task_group *tg)
+{
+	free_percpu(tg->numa_stat);
+}
+
+static void update_tg_numa_stat(struct task_struct *p)
+{
+	struct task_group *tg;
+	unsigned long remote = p->numa_faults_locality[3];
+	unsigned long local = p->numa_faults_locality[4];
+	int idx = -1;
+
+	/* Tobe scaled? */
+	if (remote || local)
+		idx = NR_NL_INTERVAL * local / (remote + local + 1);
+
+	rcu_read_lock();
+
+	tg = task_group(p);
+	while (tg) {
+		/* skip account when there are no faults records */
+		if (idx != -1)
+			this_cpu_inc(tg->numa_stat->locality[idx]);
+
+		tg = tg->parent;
+	}
+
+	rcu_read_unlock();
+}
+
 /*
  * Drive the periodic memory faults..
  */
@@ -2625,6 +2672,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
 		return;

+	update_tg_numa_stat(curr);
+
 	/*
 	 * Using runtime rather than walltime has the dual advantage that
 	 * we (mostly) drive the selection from busy threads and that the
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 802b1f3405f2..685a9e670880 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -353,6 +353,17 @@ struct cfs_bandwidth {
 #endif
 };

+#ifdef CONFIG_NUMA_BALANCING
+
+/* NUMA Locality Interval, 8 bucket for cache align */
+#define NR_NL_INTERVAL	8
+
+struct numa_stat {
+	u64 locality[NR_NL_INTERVAL];
+};
+
+#endif
+
 /* Task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
@@ -393,8 +404,26 @@ struct task_group {
 #endif

 	struct cfs_bandwidth	cfs_bandwidth;
+
+#ifdef CONFIG_NUMA_BALANCING
+	struct numa_stat __percpu *numa_stat;
+#endif
 };

+#ifdef CONFIG_NUMA_BALANCING
+int alloc_tg_numa_stat(struct task_group *tg);
+void free_tg_numa_stat(struct task_group *tg);
+#else
+static int alloc_tg_numa_stat(struct task_group *tg)
+{
+	return 1;
+}
+
+static void free_tg_numa_stat(struct task_group *tg)
+{
+}
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD


From patchwork Tue Jul 16 03:40:35 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
X-Patchwork-Id: 11045195
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 090F213B1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 16 Jul 2019 03:40:47 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id ED17B2018E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 16 Jul 2019 03:40:46 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id E00DC26E4A; Tue, 16 Jul 2019 03:40:46 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64,
	MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=unavailable
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 91E882018E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 16 Jul 2019 03:40:46 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1731166AbfGPDkm (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Mon, 15 Jul 2019 23:40:42 -0400
Received: from out30-44.freemail.mail.aliyun.com ([115.124.30.44]:46248 "EHLO
        out30-44.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1730750AbfGPDkm (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Mon, 15 Jul 2019 23:40:42 -0400
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R531e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01422;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0TX1TAFc_1563248435;
Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com
 fp:SMTPD_---0TX1TAFc_1563248435)
          by smtp.aliyun-inc.com(127.0.0.1);
          Tue, 16 Jul 2019 11:40:36 +0800
Subject: [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat
From: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
To: Peter Zijlstra <peterz@infradead.org>, hannes@cmpxchg.org,
        mhocko@kernel.org, vdavydov.dev@gmail.com,
        Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, mcgrof@kernel.org,
 keescook@chromium.org, linux-fsdevel@vger.kernel.org,
 cgroups@vger.kernel.org, =?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Hillf Danton <hdanton@sina.com>
References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com>
 <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com>
 <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com>
Message-ID: <6973a1bf-88f2-b54e-726d-8b7d95d80197@linux.alibaba.com>
Date: Tue, 16 Jul 2019 11:40:35 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0)
 Gecko/20100101 Thunderbird/60.7.0
MIME-Version: 1.0
In-Reply-To: <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com>
Content-Language: en-US
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

This patch introduced numa execution time information, to imply the numa
efficiency.

By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we see new
output line heading with 'exectime', like:

  exectime 311900 407166

which means the tasks of this cgroup executed 311900 micro seconds on
node 0, and 407166 ms on node 1.

Combined with the memory node info from memory cgroup, we can estimate
the numa efficiency, for example if the memory.numa_stat show:

  total=206892 N0=21933 N1=185171

By monitoring the increments, if the topology keep in this way and
locality is not nice, then it imply numa balancing can't help migrate
the memory from node 1 to 0 which is accessing by tasks on node 0, or
tasks can't migrate to node 1 for some reason, then you may consider
to bind the workloads on the cpus of node 1.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
Since v1:
  * move implementation from memory cgroup into cpu group
  * exectime now accounting in hierarchical way
  * change member name into jiffies

 kernel/sched/core.c  | 12 ++++++++++++
 kernel/sched/fair.c  |  2 ++
 kernel/sched/sched.h |  1 +
 3 files changed, 15 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 71a8d3ed8495..f8aa73aa879b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7307,6 +7307,18 @@ static int cpu_numa_stat_show(struct seq_file *sf, void *v)
 	}
 	seq_putc(sf, '\n');

+	seq_puts(sf, "exectime");
+	for_each_online_node(nr) {
+		int cpu;
+		u64 sum = 0;
+
+		for_each_cpu(cpu, cpumask_of_node(nr))
+			sum += per_cpu(tg->numa_stat->jiffies, cpu);
+
+		seq_printf(sf, " %u", jiffies_to_msecs(sum));
+	}
+	seq_putc(sf, '\n');
+
 	return 0;
 }
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd716355d70e..2c362266af76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2652,6 +2652,8 @@ static void update_tg_numa_stat(struct task_struct *p)
 		if (idx != -1)
 			this_cpu_inc(tg->numa_stat->locality[idx]);

+		this_cpu_inc(tg->numa_stat->jiffies);
+
 		tg = tg->parent;
 	}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 685a9e670880..456f83f7f595 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -360,6 +360,7 @@ struct cfs_bandwidth {

 struct numa_stat {
 	u64 locality[NR_NL_INTERVAL];
+	u64 jiffies;
 };

 #endif

From patchwork Tue Jul 16 03:41:07 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
X-Patchwork-Id: 11045199
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0D7DE13B1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 16 Jul 2019 03:41:24 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F1BD528451
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 16 Jul 2019 03:41:23 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id E506828553; Tue, 16 Jul 2019 03:41:23 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64,
	MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=unavailable
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4F91F28451
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 16 Jul 2019 03:41:23 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1731114AbfGPDlT (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Mon, 15 Jul 2019 23:41:19 -0400
Received: from out30-131.freemail.mail.aliyun.com ([115.124.30.131]:43740
 "EHLO
        out30-131.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1730275AbfGPDlT (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Mon, 15 Jul 2019 23:41:19 -0400
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R191e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0TX1aoqx_1563248467;
Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com
 fp:SMTPD_---0TX1aoqx_1563248467)
          by smtp.aliyun-inc.com(127.0.0.1);
          Tue, 16 Jul 2019 11:41:07 +0800
Subject: [PATCH v2 3/4] numa: introduce numa group per task group
From: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
To: Peter Zijlstra <peterz@infradead.org>, hannes@cmpxchg.org,
        mhocko@kernel.org, vdavydov.dev@gmail.com,
        Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, mcgrof@kernel.org,
 keescook@chromium.org, linux-fsdevel@vger.kernel.org,
 cgroups@vger.kernel.org, =?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Hillf Danton <hdanton@sina.com>
References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com>
 <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com>
 <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com>
Message-ID: <e91a257d-3936-68b5-4845-21bd93db6733@linux.alibaba.com>
Date: Tue, 16 Jul 2019 11:41:07 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0)
 Gecko/20100101 Thunderbird/60.7.0
MIME-Version: 1.0
In-Reply-To: <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com>
Content-Language: en-US
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

By tracing numa page faults, we recognize tasks sharing the same page,
and try pack them together into a single numa group.

However when two task share lot's of cache pages while not much
anonymous pages, since numa balancing do not tracing cache page, they
have no chance to join into the same group.

While tracing cache page cost too much, we could use some hints from
userland and cpu cgroup could be a good one.

This patch introduced new entry 'numa_group' for cpu cgroup, by echo
non-zero into the entry, we can now force all the tasks of this cgroup
to join the same numa group serving for task group.

In this way tasks are more likely to settle down on the same node, to
share closer cpu cache and gain benefit from NUMA on both file/anonymous
pages.

Besides, when multiple cgroup enabled numa group, they will be able to
exchange task location by utilizing numa migration, in this way they
could achieve single node settle down without breaking load balance.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
Since v1:
  * just rebase, no logical changes

 kernel/sched/core.c  |  33 ++++++++++
 kernel/sched/fair.c  | 175 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  11 ++++
 3 files changed, 218 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f8aa73aa879b..9f100c48d6e4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6802,6 +6802,8 @@ void sched_offline_group(struct task_group *tg)
 {
 	unsigned long flags;

+	update_tg_numa_group(tg, false);
+
 	/* End participation in shares distribution: */
 	unregister_fair_sched_group(tg);

@@ -7321,6 +7323,32 @@ static int cpu_numa_stat_show(struct seq_file *sf, void *v)

 	return 0;
 }
+
+static DEFINE_MUTEX(numa_mutex);
+
+static int cpu_numa_group_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	mutex_lock(&numa_mutex);
+	show_tg_numa_group(tg, sf);
+	mutex_unlock(&numa_mutex);
+
+	return 0;
+}
+
+static int cpu_numa_group_write_s64(struct cgroup_subsys_state *css,
+				struct cftype *cft, s64 numa_group)
+{
+	int ret;
+	struct task_group *tg = css_tg(css);
+
+	mutex_lock(&numa_mutex);
+	ret = update_tg_numa_group(tg, numa_group);
+	mutex_unlock(&numa_mutex);
+
+	return ret;
+}
 #endif

 static struct cftype cpu_legacy_files[] = {
@@ -7364,6 +7392,11 @@ static struct cftype cpu_legacy_files[] = {
 		.name = "numa_stat",
 		.seq_show = cpu_numa_stat_show,
 	},
+	{
+		.name = "numa_group",
+		.write_s64 = cpu_numa_group_write_s64,
+		.seq_show = cpu_numa_group_show,
+	},
 #endif
 	{ }	/* Terminate */
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2c362266af76..c28ba040a563 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1073,6 +1073,7 @@ struct numa_group {
 	int nr_tasks;
 	pid_t gid;
 	int active_nodes;
+	bool evacuate;

 	struct rcu_head rcu;
 	unsigned long total_faults;
@@ -2246,6 +2247,176 @@ static inline void put_numa_group(struct numa_group *grp)
 		kfree_rcu(grp, rcu);
 }

+void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
+{
+	int nid;
+	struct numa_group *ng = tg->numa_group;
+
+	if (!ng) {
+		seq_puts(sf, "disabled\n");
+		return;
+	}
+
+	seq_printf(sf, "id %d nr_tasks %d active_nodes %d\n",
+		   ng->gid, ng->nr_tasks, ng->active_nodes);
+
+	for_each_online_node(nid) {
+		int f_idx = task_faults_idx(NUMA_MEM, nid, 0);
+		int pf_idx = task_faults_idx(NUMA_MEM, nid, 1);
+
+		seq_printf(sf, "node %d ", nid);
+
+		seq_printf(sf, "mem_private %lu mem_shared %lu ",
+			   ng->faults[f_idx], ng->faults[pf_idx]);
+
+		seq_printf(sf, "cpu_private %lu cpu_shared %lu\n",
+			   ng->faults_cpu[f_idx], ng->faults_cpu[pf_idx]);
+	}
+}
+
+int update_tg_numa_group(struct task_group *tg, bool numa_group)
+{
+	struct numa_group *ng = tg->numa_group;
+
+	/* if no change then do nothing */
+	if ((ng != NULL) == numa_group)
+		return 0;
+
+	if (ng) {
+		/* put and evacuate tg's numa group */
+		rcu_assign_pointer(tg->numa_group, NULL);
+		ng->evacuate = true;
+		put_numa_group(ng);
+	} else {
+		unsigned int size = sizeof(struct numa_group) +
+				    4*nr_node_ids*sizeof(unsigned long);
+
+		ng = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+		if (!ng)
+			return -ENOMEM;
+
+		refcount_set(&ng->refcount, 1);
+		spin_lock_init(&ng->lock);
+		ng->faults_cpu = ng->faults + NR_NUMA_HINT_FAULT_TYPES *
+						nr_node_ids;
+		/* now make tasks see and join */
+		rcu_assign_pointer(tg->numa_group, ng);
+	}
+
+	return 0;
+}
+
+static bool tg_numa_group(struct task_struct *p)
+{
+	int i;
+	struct task_group *tg;
+	struct numa_group *grp, *my_grp;
+
+	rcu_read_lock();
+
+	tg = task_group(p);
+	if (!tg)
+		goto no_join;
+
+	grp = rcu_dereference(tg->numa_group);
+	my_grp = rcu_dereference(p->numa_group);
+
+	if (!grp)
+		goto no_join;
+
+	if (grp == my_grp) {
+		if (!grp->evacuate)
+			goto joined;
+
+		/*
+		 * Evacuate task from tg's numa group
+		 */
+		rcu_read_unlock();
+
+		spin_lock_irq(&grp->lock);
+
+		for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
+			grp->faults[i] -= p->numa_faults[i];
+
+		grp->total_faults -= p->total_numa_faults;
+		grp->nr_tasks--;
+
+		spin_unlock_irq(&grp->lock);
+
+		rcu_assign_pointer(p->numa_group, NULL);
+
+		put_numa_group(grp);
+
+		return false;
+	}
+
+	if (!get_numa_group(grp))
+		goto no_join;
+
+	rcu_read_unlock();
+
+	/*
+	 * Just join tg's numa group
+	 */
+	if (!my_grp) {
+		spin_lock_irq(&grp->lock);
+
+		if (refcount_read(&grp->refcount) == 2) {
+			grp->gid = p->pid;
+			grp->active_nodes = 1;
+			grp->max_faults_cpu = 0;
+		}
+
+		for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
+			grp->faults[i] += p->numa_faults[i];
+
+		grp->total_faults += p->total_numa_faults;
+		grp->nr_tasks++;
+
+		spin_unlock_irq(&grp->lock);
+		rcu_assign_pointer(p->numa_group, grp);
+
+		return true;
+	}
+
+	/*
+	 * Switch from the task's numa group to the tg's
+	 */
+	double_lock_irq(&my_grp->lock, &grp->lock);
+
+	if (refcount_read(&grp->refcount) == 2) {
+		grp->gid = p->pid;
+		grp->active_nodes = 1;
+		grp->max_faults_cpu = 0;
+	}
+
+	for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
+		my_grp->faults[i] -= p->numa_faults[i];
+		grp->faults[i] += p->numa_faults[i];
+	}
+
+	my_grp->total_faults -= p->total_numa_faults;
+	grp->total_faults += p->total_numa_faults;
+
+	my_grp->nr_tasks--;
+	grp->nr_tasks++;
+
+	spin_unlock(&my_grp->lock);
+	spin_unlock_irq(&grp->lock);
+
+	rcu_assign_pointer(p->numa_group, grp);
+
+	put_numa_group(my_grp);
+	return true;
+
+joined:
+	rcu_read_unlock();
+	return true;
+no_join:
+	rcu_read_unlock();
+	return false;
+}
+
 static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 			int *priv)
 {
@@ -2416,7 +2587,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 		priv = 1;
 	} else {
 		priv = cpupid_match_pid(p, last_cpupid);
-		if (!priv && !(flags & TNF_NO_GROUP))
+		if (tg_numa_group(p))
+			priv = (flags & TNF_SHARED) ? 0 : priv;
+		else if (!priv && !(flags & TNF_NO_GROUP))
 			task_numa_group(p, last_cpupid, flags, &priv);
 	}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 456f83f7f595..23e4a62cd37b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -408,6 +408,7 @@ struct task_group {

 #ifdef CONFIG_NUMA_BALANCING
 	struct numa_stat __percpu *numa_stat;
+	void *numa_group;
 #endif
 };

@@ -1316,11 +1317,21 @@ extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *p, struct task_struct *t,
 			int cpu, int scpu);
 extern void init_numa_balancing(unsigned long clone_flags, struct task_struct *p);
+extern void show_tg_numa_group(struct task_group *tg, struct seq_file *sf);
+extern int update_tg_numa_group(struct task_group *tg, bool numa_group);
 #else
 static inline void
 init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 {
 }
+static inline void
+show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
+{
+}
+update_tg_numa_group(struct task_group *tg, bool numa_group)
+{
+	return 0;
+}
 #endif /* CONFIG_NUMA_BALANCING */

 #ifdef CONFIG_SMP

From patchwork Tue Jul 16 03:41:43 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
X-Patchwork-Id: 11045201
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 612AF13B1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 16 Jul 2019 03:42:04 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4ACFF28451
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 16 Jul 2019 03:42:04 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 3D23D28553; Tue, 16 Jul 2019 03:42:04 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64,
	MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EE63A28451
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 16 Jul 2019 03:42:02 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1731063AbfGPDl7 (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Mon, 15 Jul 2019 23:41:59 -0400
Received: from out30-54.freemail.mail.aliyun.com ([115.124.30.54]:55773 "EHLO
        out30-54.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1730275AbfGPDl6 (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Mon, 15 Jul 2019 23:41:58 -0400
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R331e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04423;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0TX1ZkrX_1563248503;
Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com
 fp:SMTPD_---0TX1ZkrX_1563248503)
          by smtp.aliyun-inc.com(127.0.0.1);
          Tue, 16 Jul 2019 11:41:44 +0800
Subject: [PATCH v4 4/4] numa: introduce numa cling feature
From: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
To: Peter Zijlstra <peterz@infradead.org>, hannes@cmpxchg.org,
        mhocko@kernel.org, vdavydov.dev@gmail.com,
        Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, mcgrof@kernel.org,
 keescook@chromium.org, linux-fsdevel@vger.kernel.org,
 cgroups@vger.kernel.org, =?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Hillf Danton <hdanton@sina.com>
References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com>
 <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com>
 <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com>
Message-ID: <9867474c-cc6f-4cc1-5de9-5d17ecf5bb02@linux.alibaba.com>
Date: Tue, 16 Jul 2019 11:41:43 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0)
 Gecko/20100101 Thunderbird/60.7.0
MIME-Version: 1.0
In-Reply-To: <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com>
Content-Language: en-US
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Although we paid so many effort to settle down task on a particular
node, there are still chances for a task to leave it's preferred
node, that is by wakeup, numa swap migrations or load balance.

When we are using cpu cgroup in share way, since all the workloads
see all the cpus, it could be really bad especially when there
are too many fast wakeup, although now we can numa group the tasks,
they won't really stay on the same node, for example we have numa
group ng_A, ng_B, ng_C, ng_D, it's very likely result as:

	CPU Usage:
		Node 0		Node 1
		ng_A(600%)	ng_A(400%)
		ng_B(400%)	ng_B(600%)
		ng_C(400%)	ng_C(600%)
		ng_D(600%)	ng_D(400%)

	Memory Ratio:
		Node 0		Node 1
		ng_A(60%)	ng_A(40%)
		ng_B(40%)	ng_B(60%)
		ng_C(40%)	ng_C(60%)
		ng_D(60%)	ng_D(40%)

Locality won't be too bad but far from the best situation, we want
a numa group to settle down thoroughly on a particular node, with
every thing balanced.

Thus we introduce the numa cling, which try to prevent tasks leaving
the preferred node on wakeup fast path.

This help thoroughly settle down the workloads on single node, but when
multiple numa group try to settle down on the same node, unbalancing
could happen.

For example we have numa group ng_A, ng_B, ng_C, ng_D, it may result in
situation like:

CPU Usage:
	Node 0		Node 1
	ng_A(1000%)	ng_B(1000%)
	ng_C(400%)	ng_C(600%)
	ng_D(400%)	ng_D(600%)

Memory Ratio:
	Node 0		Node 1
	ng_A(100%)	ng_B(100%)
	ng_C(10%)	ng_C(90%)
	ng_D(10%)	ng_D(90%)

This is because when ng_C, ng_D start to have most of the memory on node
1 at some point, task_x of ng_C stay on node 0 will try to do numa swap
migration with the task_y of ng_D stay on node 1 as long as load balanced,
the result is task_x stay on node 1 and task_y stay on node 0, while both
of them prefer node 1.

Now when other tasks of ng_D stay on node 1 wakeup task_y, task_y will
very likely go back to node 1, and since numa cling enabled, it will
keep stay on node 1 although load unbalanced, this could be frequently
and more and more tasks will prefer the node 1 and make it busy.

So the key point here is to stop doing numa cling when load starting to
become unbalancing.

We achieved this by monitoring the migration failure ratio, in scenery
above, too much tasks prefer node 1 and will keep migrating to it, load
unbalancing could lead into the migration failure in this case, and when
the failure ratio above the specified degree, we pause the cling and try
to resettle the workloads on a better node by stop tasks prefer the busy
node, this will finally give us the result like:

CPU Usage:
	Node 0		Node 1
	ng_A(1000%)	ng_B(1000%)
	ng_C(1000%)	ng_D(1000%)

Memory Ratio:
	Node 0		Node 1
	ng_A(100%)	ng_B(100%)
	ng_C(100%)	ng_D(100%)

Now we achieved the best locality and maximum hot cache benefit.

Tested on a 2 node box with 96 cpus, do sysbench-mysql-oltp_read_write
testing, X mysqld instances created and attached to X cgroups, X sysbench
instances then created and attached to corresponding cgroup to test the
mysql with oltp_read_write script for 20 minutes, average eps show:

				origin		ng + cling
4 instances each 24 threads	7641.27		8010.18		+4.83%
4 instances each 48 threads	9423.39		10021.03	+6.34%
4 instances each 72 threads	9691.47		10192.73	+5.17%

8 instances each 24 threads	4485.44		4577.95		+2.06%
8 instances each 48 threads	5565.06		5737.50		+3.10%
8 instances each 72 threads	5605.20		5752.33		+2.63%

Also tested with perf-bench-numa, dbench, sysbench-memory, pgbench, tiny
improvement observed.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
Since v3:
  * numa cling no longer override select_idle_sibling, instead we
    prevent numa swap migration with tasks cling to dst-node, also
    prevent wake affine to drag tasks away which already cling to
    prev-cpu
  * refine comments

 include/linux/sched/sysctl.h |   3 +
 kernel/sched/fair.c          | 296 ++++++++++++++++++++++++++++++++++++++++---
 kernel/sysctl.c              |   9 ++
 3 files changed, 292 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..6eef34331dd2 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -38,6 +38,9 @@ extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;

+extern unsigned int sysctl_numa_balancing_cling_degree;
+extern unsigned int max_numa_balancing_cling_degree;
+
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c28ba040a563..e7525bda5a94 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1066,6 +1066,20 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;

+/*
+ * The numa group serving task group will enable numa cling, a feature
+ * which try to prevent task leaving preferred node on wakeup.
+ *
+ * This help settle down the workloads thorouly and quickly on node,
+ * while introduce the risk of load unbalancing.
+ *
+ * In order to detect the risk in advance and pause the feature, we
+ * rely on numa migration failure stats, and when failure ratio above
+ * cling degree, we pause the numa cling until resettle done.
+ */
+unsigned int sysctl_numa_balancing_cling_degree = 20;
+unsigned int max_numa_balancing_cling_degree = 100;
+
 struct numa_group {
 	refcount_t refcount;

@@ -1073,11 +1087,15 @@ struct numa_group {
 	int nr_tasks;
 	pid_t gid;
 	int active_nodes;
+	int busiest_nid;
 	bool evacuate;
+	bool do_cling;
+	struct timer_list cling_timer;

 	struct rcu_head rcu;
 	unsigned long total_faults;
 	unsigned long max_faults_cpu;
+	unsigned long *migrate_stat;
 	/*
 	 * Faults_cpu is used to decide whether memory should move
 	 * towards the CPU. As a consequence, these stats are weighted
@@ -1087,6 +1105,8 @@ struct numa_group {
 	unsigned long faults[0];
 };

+static inline bool busy_node(struct numa_group *ng, int nid);
+
 static inline unsigned long group_faults_priv(struct numa_group *ng);
 static inline unsigned long group_faults_shared(struct numa_group *ng);

@@ -1131,8 +1151,14 @@ static unsigned int task_scan_start(struct task_struct *p)
 	unsigned long smin = task_scan_min(p);
 	unsigned long period = smin;

-	/* Scale the maximum scan period with the amount of shared memory. */
-	if (p->numa_group) {
+	/*
+	 * Scale the maximum scan period with the amount of shared memory.
+	 *
+	 * Not for the numa group serving task group, it's tasks are not
+	 * gathered for sharing memory, and we need to detect migration
+	 * failure in time.
+	 */
+	if (p->numa_group && !p->numa_group->do_cling) {
 		struct numa_group *ng = p->numa_group;
 		unsigned long shared = group_faults_shared(ng);
 		unsigned long private = group_faults_priv(ng);
@@ -1153,8 +1179,14 @@ static unsigned int task_scan_max(struct task_struct *p)
 	/* Watch for min being lower than max due to floor calculations */
 	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);

-	/* Scale the maximum scan period with the amount of shared memory. */
-	if (p->numa_group) {
+	/*
+	 * Scale the maximum scan period with the amount of shared memory.
+	 *
+	 * Not for the numa group serving task group, it's tasks are not
+	 * gathered for sharing memory, and we need to detect migration
+	 * failure in time.
+	 */
+	if (p->numa_group && !p->numa_group->do_cling) {
 		struct numa_group *ng = p->numa_group;
 		unsigned long shared = group_faults_shared(ng);
 		unsigned long private = group_faults_priv(ng);
@@ -1474,6 +1506,19 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 					ACTIVE_NODE_FRACTION)
 		return true;

+	/*
+	 * Make sure pages do not stay on a busy node when numa cling
+	 * enabled, otherwise they could lead into more numa migration
+	 * to the busy node.
+	 */
+	if (ng->do_cling) {
+		if (busy_node(ng, dst_nid))
+			return false;
+
+		if (busy_node(ng, src_nid))
+			return true;
+	}
+
 	/*
 	 * Distribute memory according to CPU & memory use on each node,
 	 * with 3/4 hysteresis to avoid unnecessary memory migrations:
@@ -1592,6 +1637,9 @@ static bool load_too_imbalanced(long src_load, long dst_load,
  */
 #define SMALLIMP	30

+static inline bool
+task_numa_cling(struct task_struct *p, int snid, int dnid);
+
 /*
  * This checks if the overall compute and NUMA accesses of the system would
  * be improved if the source tasks was migrated to the target dst_cpu taking
@@ -1710,6 +1758,10 @@ static void task_numa_compare(struct task_numa_env *env,
 		env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
 						   env->dst_cpu);
 		local_irq_enable();
+	} else {
+		/* Do not swap with a task cling to 'dst_nid' */
+		if (task_numa_cling(cur, env->dst_nid, env->src_nid))
+			goto unlock;
 	}

 	task_numa_assign(env, cur, imp);
@@ -1873,9 +1925,191 @@ static int task_numa_migrate(struct task_struct *p)
 	return ret;
 }

+/*
+ * We scale the migration stat count to 1024, divide the maximum numa
+ * balancing scan period by 10 and make that the period of cling timer,
+ * this help to decay one count to 0 after one maximum scan period passed.
+ */
+#define NUMA_MIGRATE_SCALE 10
+#define NUMA_MIGRATE_WEIGHT 1024
+
+enum numa_migrate_stats {
+	FAILURE_SCALED,
+	TOTAL_SCALED,
+	FAILURE_RATIO,
+};
+
+static inline int mstat_idx(int nid, enum numa_migrate_stats s)
+{
+	return (nid + s * nr_node_ids);
+}
+
+static inline unsigned long
+mstat_failure_scaled(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, FAILURE_SCALED)];
+}
+
+static inline unsigned long
+mstat_total_scaled(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, TOTAL_SCALED)];
+}
+
+static inline unsigned long
+mstat_failure_ratio(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, FAILURE_RATIO)];
+}
+
+/*
+ * A node is busy when the numa migration toward it failed too much,
+ * this imply the load already unbalancing for too much numa cling on
+ * that node.
+ */
+static inline bool busy_node(struct numa_group *ng, int nid)
+{
+	int degree = sysctl_numa_balancing_cling_degree;
+
+	if (mstat_failure_scaled(ng, nid) < NUMA_MIGRATE_WEIGHT)
+		return false;
+
+	/*
+	 * Allow only one busy node in one numa group, to prevent
+	 * ping-pong migration case between nodes.
+	 */
+	if (ng->busiest_nid != nid)
+		return false;
+
+	return mstat_failure_ratio(ng, nid) > degree;
+}
+
+/*
+ * Return true if the task should cling to snid, when it preferred snid
+ * rather than dnid and snid is not busy.
+ */
+static inline bool
+task_numa_cling(struct task_struct *p, int snid, int dnid)
+{
+	bool ret = false;
+	int pnid = p->numa_preferred_nid;
+	struct numa_group *ng;
+
+	rcu_read_lock();
+
+	ng = p->numa_group;
+
+	/* Do cling only when the feature enabled and not in pause */
+	if (!ng || !ng->do_cling)
+		goto out;
+
+	if (pnid == NUMA_NO_NODE ||
+	    dnid == pnid ||
+	    snid != pnid)
+		goto out;
+
+	/* Never allow cling to a busy node */
+	if (busy_node(ng, snid))
+		goto out;
+
+	ret = true;
+out:
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Prevent more tasks from prefer the busy node to easy the unbalancing,
+ * also give the second candidate a chance.
+ */
+static inline bool group_pause_prefer(struct numa_group *ng, int nid)
+{
+	if (!ng || !ng->do_cling)
+		return false;
+
+	return busy_node(ng, nid);
+}
+
+static inline void update_failure_ratio(struct numa_group *ng, int nid)
+{
+	int f_idx = mstat_idx(nid, FAILURE_SCALED);
+	int t_idx = mstat_idx(nid, TOTAL_SCALED);
+	int fp_idx = mstat_idx(nid, FAILURE_RATIO);
+
+	ng->migrate_stat[fp_idx] =
+		ng->migrate_stat[f_idx] * 100 / (ng->migrate_stat[t_idx] + 1);
+}
+
+static void cling_timer_func(struct timer_list *t)
+{
+	int nid;
+	unsigned int degree;
+	unsigned long period, max_failure;
+	struct numa_group *ng = from_timer(ng, t, cling_timer);
+
+	degree = sysctl_numa_balancing_cling_degree;
+	period = msecs_to_jiffies(sysctl_numa_balancing_scan_period_max);
+	period /= NUMA_MIGRATE_SCALE;
+
+	spin_lock_irq(&ng->lock);
+
+	max_failure = 0;
+	for_each_online_node(nid) {
+		int f_idx = mstat_idx(nid, FAILURE_SCALED);
+		int t_idx = mstat_idx(nid, TOTAL_SCALED);
+
+		ng->migrate_stat[f_idx] /= 2;
+		ng->migrate_stat[t_idx] /= 2;
+
+		update_failure_ratio(ng, nid);
+
+		if (ng->migrate_stat[f_idx] > max_failure) {
+			ng->busiest_nid = nid;
+			max_failure = ng->migrate_stat[f_idx];
+		}
+	}
+
+	spin_unlock_irq(&ng->lock);
+
+	mod_timer(&ng->cling_timer, jiffies + period);
+}
+
+static inline void
+update_migrate_stat(struct task_struct *p, int nid, bool failed)
+{
+	int idx;
+	struct numa_group *ng = p->numa_group;
+
+	if (!ng || !ng->do_cling)
+		return;
+
+	spin_lock_irq(&ng->lock);
+
+	if (failed) {
+		idx = mstat_idx(nid, FAILURE_SCALED);
+		ng->migrate_stat[idx] += NUMA_MIGRATE_WEIGHT;
+	}
+
+	idx = mstat_idx(nid, TOTAL_SCALED);
+	ng->migrate_stat[idx] += NUMA_MIGRATE_WEIGHT;
+	update_failure_ratio(ng, nid);
+
+	spin_unlock_irq(&ng->lock);
+
+	/*
+	 * On failed task may prefer source node instead, this
+	 * cause ping-pong migration when numa cling enabled,
+	 * so let's reset the preferred node to none.
+	 */
+	if (failed)
+		sched_setnuma(p, NUMA_NO_NODE);
+}
+
 /* Attempt to migrate a task to a CPU on the preferred node. */
 static void numa_migrate_preferred(struct task_struct *p)
 {
+	bool failed;
+	int target;
 	unsigned long interval = HZ;

 	/* This task has no NUMA fault statistics yet */
@@ -1890,8 +2124,12 @@ static void numa_migrate_preferred(struct task_struct *p)
 	if (task_node(p) == p->numa_preferred_nid)
 		return;

+	target = p->numa_preferred_nid;
+
 	/* Otherwise, try migrate to a CPU on the preferred node */
-	task_numa_migrate(p);
+	failed = (task_numa_migrate(p) != 0);
+
+	update_migrate_stat(p, target, failed);
 }

 /*
@@ -2215,7 +2453,8 @@ static void task_numa_placement(struct task_struct *p)
 				max_faults = faults;
 				max_nid = nid;
 			}
-		} else if (group_faults > max_faults) {
+		} else if (group_faults > max_faults &&
+			   !group_pause_prefer(p->numa_group, nid)) {
 			max_faults = group_faults;
 			max_nid = nid;
 		}
@@ -2257,8 +2496,10 @@ void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
 		return;
 	}

-	seq_printf(sf, "id %d nr_tasks %d active_nodes %d\n",
-		   ng->gid, ng->nr_tasks, ng->active_nodes);
+	spin_lock_irq(&ng->lock);
+
+	seq_printf(sf, "id %d nr_tasks %d active_nodes %d busiest_nid %d\n",
+		   ng->gid, ng->nr_tasks, ng->active_nodes, ng->busiest_nid);

 	for_each_online_node(nid) {
 		int f_idx = task_faults_idx(NUMA_MEM, nid, 0);
@@ -2269,9 +2510,16 @@ void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
 		seq_printf(sf, "mem_private %lu mem_shared %lu ",
 			   ng->faults[f_idx], ng->faults[pf_idx]);

-		seq_printf(sf, "cpu_private %lu cpu_shared %lu\n",
+		seq_printf(sf, "cpu_private %lu cpu_shared %lu ",
 			   ng->faults_cpu[f_idx], ng->faults_cpu[pf_idx]);
+
+		seq_printf(sf, "migrate_stat %lu %lu %lu\n",
+			   mstat_failure_scaled(ng, nid),
+			   mstat_total_scaled(ng, nid),
+			   mstat_failure_ratio(ng, nid));
 	}
+
+	spin_unlock_irq(&ng->lock);
 }

 int update_tg_numa_group(struct task_group *tg, bool numa_group)
@@ -2285,20 +2533,26 @@ int update_tg_numa_group(struct task_group *tg, bool numa_group)
 	if (ng) {
 		/* put and evacuate tg's numa group */
 		rcu_assign_pointer(tg->numa_group, NULL);
+		del_timer_sync(&ng->cling_timer);
 		ng->evacuate = true;
 		put_numa_group(ng);
 	} else {
 		unsigned int size = sizeof(struct numa_group) +
-				    4*nr_node_ids*sizeof(unsigned long);
+				    7*nr_node_ids*sizeof(unsigned long);
+		unsigned int offset = NR_NUMA_HINT_FAULT_TYPES * nr_node_ids;

 		ng = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 		if (!ng)
 			return -ENOMEM;

 		refcount_set(&ng->refcount, 1);
+		ng->busiest_nid = NUMA_NO_NODE;
+		ng->do_cling = true;
+		timer_setup(&ng->cling_timer, cling_timer_func, 0);
 		spin_lock_init(&ng->lock);
-		ng->faults_cpu = ng->faults + NR_NUMA_HINT_FAULT_TYPES *
-						nr_node_ids;
+		ng->faults_cpu = ng->faults + offset;
+		ng->migrate_stat = ng->faults_cpu + offset;
+		add_timer(&ng->cling_timer);
 		/* now make tasks see and join */
 		rcu_assign_pointer(tg->numa_group, ng);
 	}
@@ -2435,6 +2689,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 			return;

 		refcount_set(&grp->refcount, 1);
+		grp->busiest_nid = NUMA_NO_NODE;
 		grp->active_nodes = 1;
 		grp->max_faults_cpu = 0;
 		spin_lock_init(&grp->lock);
@@ -2921,6 +3176,11 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
 {
 }

+static inline bool task_numa_cling(struct task_struct *p, int snid, int dnid)
+{
+	return false;
+}
+
 #endif /* CONFIG_NUMA_BALANCING */

 static void
@@ -6674,8 +6934,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			new_cpu = prev_cpu;
 		}

-		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
-			      cpumask_test_cpu(cpu, p->cpus_ptr);
+		want_affine = !wake_wide(p) &&
+			      !wake_cap(p, cpu, prev_cpu) &&
+			      cpumask_test_cpu(cpu, p->cpus_ptr) &&
+			      !task_numa_cling(p, cpu_to_node(prev_cpu),
+						cpu_to_node(cpu));
 	}

 	rcu_read_lock();
@@ -6707,12 +6970,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
 		/* Fast path */
-
 		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

 		if (want_affine)
 			current->recent_used_cpu = cpu;
 	}
+
 	rcu_read_unlock();

 	return new_cpu;
@@ -7384,7 +7647,8 @@ static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)

 	/* Migrating away from the preferred node is always bad. */
 	if (src_nid == p->numa_preferred_nid) {
-		if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
+		if (task_numa_cling(p, src_nid, dst_nid) ||
+		    env->src_rq->nr_running > env->src_rq->nr_preferred_running)
 			return 1;
 		else
 			return -1;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 078950d9605b..0a889dd1c7ed 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -417,6 +417,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "numa_balancing_cling_degree",
+		.data		= &sysctl_numa_balancing_cling_degree,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &max_numa_balancing_cling_degree,
+	},
 	{
 		.procname	= "numa_balancing",
 		.data		= NULL, /* filled in by handler */