From patchwork Mon Apr 22 02:11:24 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
X-Patchwork-Id: 10910563
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 241541515
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:11:33 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 14C2A28686
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:11:33 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 08C73286B3; Mon, 22 Apr 2019 02:11:33 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64,
	MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE,UNPARSEABLE_RELAY autolearn=ham
	version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AB6C228686
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:11:31 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 90FF36B0006; Sun, 21 Apr 2019 22:11:30 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 895F46B0007; Sun, 21 Apr 2019 22:11:30 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7396A6B0008; Sun, 21 Apr 2019 22:11:30 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-pg1-f200.google.com (mail-pg1-f200.google.com
 [209.85.215.200])
	by kanga.kvack.org (Postfix) with ESMTP id 3185C6B0006
	for <linux-mm@kvack.org>; Sun, 21 Apr 2019 22:11:30 -0400 (EDT)
Received: by mail-pg1-f200.google.com with SMTP id n5so7104166pgk.9
        for <linux-mm@kvack.org>; Sun, 21 Apr 2019 19:11:30 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:subject:from
         :to:cc:references:message-id:date:user-agent:mime-version
         :in-reply-to:content-language:content-transfer-encoding;
        bh=/+uk769LOifZHkGx6iZ9R1GalybMGrzrLNtBe46uWio=;
        b=EBxEvgHMAcd5Cv6/tYfVUlfLOsw3y3OIXAlzHgChP+2ayjxPZKcV2MFROri+YHBXdn
         U/zlKtc7gCpTMWLa0GIvJxx0APxyZTzSc4DRcNhVaDeT/ItQKhVPQMRgLLZEFQI1b4Sd
         PycL4uegrBErl6vgLk/MpefT8ubPq1o0Vpz67gW0YQvLHnMZoRFtykAo1FArbvj0uXAM
         Vdlk9mv9F0NfjFZMOaw7eHADHl7Y7Gp2jga0kMLQAz7WWsX1iB/l8ssKuJVvuV305id9
         bt0Ox+8pIfqGqB2OTqAcsf7KfEhGG7HeYwlGMAyM+cJUoyjFX0mI//v+J7CMLfBLQXB/
         4feQ==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.133 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
X-Gm-Message-State: APjAAAV2sU95APgNfZ125ZK/cVD9oZVuPquaI96nm7zdYbcvr5+rdPHF
	G4wM23kpbf3s5X0TXT9zTTCtyecjl/PcFI0otP8NK/m2doHHwu8n/hhlqVJDp1BKaQ+kTMNaJAR
	DGZ4/T0RnQ3AG5QaUEub0iWedWQAM6fGG2fvJjwCgTUvzIR8oBI8aKsXHOnoyZbUFxw==
X-Received: by 2002:a17:902:1681:: with SMTP id
 h1mr17665524plh.102.1555899089789;
        Sun, 21 Apr 2019 19:11:29 -0700 (PDT)
X-Google-Smtp-Source: 
 APXvYqxZwNUup7n4wmKtit/uL0U6e+HT5MW8T+9J9UzwEA1C85B8Mycu/WVOILEfBfrwyCuQwjy+
X-Received: by 2002:a17:902:1681:: with SMTP id
 h1mr17665439plh.102.1555899088295;
        Sun, 21 Apr 2019 19:11:28 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1555899088; cv=none;
        d=google.com; s=arc-20160816;
        b=aV2Kg/B+A/DcXoeZEw2LTC9M5wB1pWa19uj21XFh3sZdGFSi31smvAbBl2Az6WkZvW
         kmhBvjFz6KCVot4JVPRHPbMhcBFYHiVkIvpa0+jgf/j5E8kJ54hMaIVFkqbJwGbIKlh8
         /bUxPAaXk+gbTLNEodLYInUTHQMWKonPQtVL4F3rynS41sypIVsUpSoVgpuf82iURbhg
         ZH+FtYMwp6Uxt2fvZEPMbgWTVvTQ0LFe6nCtaeAqbhkJtPob/lGt4YYHMNlRalH3DqXd
         OOnZxwoToZ84tw1EnB2xxSILFXOtbLxTqrQBS921MpZN1BtswAtiUygjztS2Kllm24b2
         mAXw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=content-transfer-encoding:content-language:in-reply-to:mime-version
         :user-agent:date:message-id:references:cc:to:from:subject;
        bh=/+uk769LOifZHkGx6iZ9R1GalybMGrzrLNtBe46uWio=;
        b=uEnujuMYNe/ynur4ak80ODgilL1lQz5eHMaNl87MEV4WsQhrQK5zp4uVLeS3tuQJeN
         jCqRbaWx3p+Zs65weTbIOKjne+ch9OhriKWg7RE0PI/5dO4+bGbUJBIoxkFlRXtIbEiK
         HX6jjWGZ9q0IHibRHpGHlSnze3vBrf9LRVSNKgUhNnMMpYsiRISQMDXsDZWd/g9r9WK/
         IeRSDQL5b435noMJPKXRzb404Q2bzVXdpWOazulom+M6tNRXoXWI4ZaUHX8k40yimaBw
         9g+dX6t2mAGiA3dLrhE5kq1I+iShSGzvWFmgdkofGL1e6Ukgulu5GcGmZKFMwN2LkIt+
         BHlw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.133 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
Received: from out30-133.freemail.mail.aliyun.com
 (out30-133.freemail.mail.aliyun.com. [115.124.30.133])
        by mx.google.com with ESMTPS id t1si1264623plr.373.2019.04.21.19.11.27
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Sun, 21 Apr 2019 19:11:28 -0700 (PDT)
Received-SPF: pass (google.com: domain of yun.wang@linux.alibaba.com
 designates 115.124.30.133 as permitted sender) client-ip=115.124.30.133;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.133 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R131e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0TPtsXaT_1555899084;
Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com
 fp:SMTPD_---0TPtsXaT_1555899084)
          by smtp.aliyun-inc.com(127.0.0.1);
          Mon, 22 Apr 2019 10:11:25 +0800
Subject: [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality,
 statistic
From: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
To: Peter Zijlstra <peterz@infradead.org>, hannes@cmpxchg.org,
 mhocko@kernel.org, vdavydov.dev@gmail.com, Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org
References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com>
Message-ID: <c0ec8861-2387-e73b-e450-2d636557a3dd@linux.alibaba.com>
Date: Mon, 22 Apr 2019 10:11:24 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0)
 Gecko/20100101 Thunderbird/60.6.1
MIME-Version: 1.0
In-Reply-To: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com>
Content-Language: en-US
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

This patch introduced numa locality statistic, which try to imply
the numa balancing efficiency per memory cgroup.

By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we
see new output line heading with 'locality', the format is:

  locality 0~9% 10%~19% 20%~29% 30%~39% 40%~49% 50%~59% 60%~69% 70%~79%
80%~89% 90%~100%

interval means that on a task's last numa balancing, the percentage
of accessing local pages, which we called numa balancing locality.

And the number means inside the cgroup, how many ticks we hit tasks with
such locality are running, for example:

  locality 7260278 54860 90493 209327 295801 462784 558897 667242
2786324 7399308

the 7260278 means that this cgroup have some tasks with 0~9% locality
executed 7260278 ticks.

By monitoring the increment, we can check if the workload of a particular
cgroup is doing well with numa, when most of the tasks are running with
locality 0~9%, then something is wrong with your numa policy.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/memcontrol.h | 38 +++++++++++++++++++++++++++++++++++
 include/linux/sched.h      |  8 +++++++-
 kernel/sched/debug.c       |  7 +++++++
 kernel/sched/fair.c        |  8 ++++++++
 mm/huge_memory.c           |  4 +---
 mm/memcontrol.c            | 50 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/memory.c                |  5 ++---
 7 files changed, 113 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 534267947664..bb62e6294484 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -179,6 +179,27 @@ enum memcg_kmem_state {
 	KMEM_ONLINE,
 };

+#ifdef CONFIG_NUMA_BALANCING
+
+enum memcg_numa_locality_interval {
+	PERCENT_0_9,
+	PERCENT_10_19,
+	PERCENT_20_29,
+	PERCENT_30_39,
+	PERCENT_40_49,
+	PERCENT_50_59,
+	PERCENT_60_69,
+	PERCENT_70_79,
+	PERCENT_80_89,
+	PERCENT_90_100,
+	NR_NL_INTERVAL,
+};
+
+struct memcg_stat_numa {
+	u64 locality[NR_NL_INTERVAL];
+};
+
+#endif
 #if defined(CONFIG_SMP)
 struct memcg_padding {
 	char x[0];
@@ -311,6 +332,10 @@ struct mem_cgroup {
 	struct list_head event_list;
 	spinlock_t event_list_lock;

+#ifdef CONFIG_NUMA_BALANCING
+	struct memcg_stat_numa __percpu *stat_numa;
+#endif
+
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
@@ -818,6 +843,14 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 void mem_cgroup_split_huge_fixup(struct page *head);
 #endif

+#ifdef CONFIG_NUMA_BALANCING
+extern void memcg_stat_numa_update(struct task_struct *p);
+#else
+static inline void memcg_stat_numa_update(struct task_struct *p)
+{
+}
+#endif
+
 #else /* CONFIG_MEMCG */

 #define MEM_CGROUP_ID_SHIFT	0
@@ -1156,6 +1189,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void memcg_stat_numa_update(struct task_struct *p)
+{
+}
+
 #endif /* CONFIG_MEMCG */

 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1a3c28d997d4..0b01262d110d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1049,8 +1049,14 @@ struct task_struct {
 	 * scan window were remote/local or failed to migrate. The task scan
 	 * period is adapted based on the locality of the faults with different
 	 * weights depending on whether they were shared or private faults
+	 *
+	 * 0 -- remote faults
+	 * 1 -- local faults
+	 * 2 -- page migration failure
+	 * 3 -- remote page accessing after page migration
+	 * 4 -- local page accessing after page migration
 	 */
-	unsigned long			numa_faults_locality[3];
+	unsigned long			numa_faults_locality[5];

 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8039d62ae36e..2898f5fa4fba 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -873,6 +873,13 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
 	SEQ_printf(m, "current_node=%d, numa_group_id=%d\n",
 			task_node(p), task_numa_group_id(p));
 	show_numa_stats(p, m);
+	SEQ_printf(m, "faults_locality local=%lu remote=%lu failed=%lu ",
+			p->numa_faults_locality[1],
+			p->numa_faults_locality[0],
+			p->numa_faults_locality[2]);
+	SEQ_printf(m, "lhit=%lu rhit=%lu\n",
+			p->numa_faults_locality[4],
+			p->numa_faults_locality[3]);
 	mpol_put(pol);
 #endif
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdab7eb6f351..ba5a67139d57 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -23,6 +23,7 @@
 #include "sched.h"

 #include <trace/events/sched.h>
+#include <linux/memcontrol.h>

 /*
  * Targeted preemption latency for CPU-bound tasks:
@@ -2387,6 +2388,11 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 		memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
 	}

+	p->numa_faults_locality[mem_node == numa_node_id() ? 4 : 3] += pages;
+
+	if (mem_node == NUMA_NO_NODE)
+		return;
+
 	/*
 	 * First accesses are treated as private, otherwise consider accesses
 	 * to be private if the accessing pid has not changed
@@ -2604,6 +2610,8 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
 		return;

+	memcg_stat_numa_update(curr);
+
 	/*
 	 * Using runtime rather than walltime has the dual advantage that
 	 * we (mostly) drive the selection from busy threads and that the
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 404acdcd0455..2614ce725a63 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1621,9 +1621,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	if (anon_vma)
 		page_unlock_anon_vma_read(anon_vma);

-	if (page_nid != NUMA_NO_NODE)
-		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
-				flags);
+	task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);

 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c532f8685aa3..b810d4e9c906 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -66,6 +66,7 @@
 #include <linux/lockdep.h>
 #include <linux/file.h>
 #include <linux/tracehook.h>
+#include <linux/cpuset.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3396,10 +3397,50 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
 		seq_putc(m, '\n');
 	}

+#ifdef CONFIG_NUMA_BALANCING
+	seq_puts(m, "locality");
+	for (nr = 0; nr < NR_NL_INTERVAL; nr++) {
+		int cpu;
+		u64 sum = 0;
+
+		for_each_possible_cpu(cpu)
+			sum += per_cpu(memcg->stat_numa->locality[nr], cpu);
+
+		seq_printf(m, " %llu", sum);
+	}
+	seq_putc(m, '\n');
+#endif
+
 	return 0;
 }
 #endif /* CONFIG_NUMA */

+#ifdef CONFIG_NUMA_BALANCING
+
+void memcg_stat_numa_update(struct task_struct *p)
+{
+	struct mem_cgroup *memcg;
+	unsigned long remote = p->numa_faults_locality[3];
+	unsigned long local = p->numa_faults_locality[4];
+	unsigned long idx = -1;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	if (remote || local) {
+		idx = (local * 10) / (remote + local);
+		if (idx >= NR_NL_INTERVAL)
+			idx = NR_NL_INTERVAL - 1;
+	}
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(p);
+	if (idx != -1)
+		this_cpu_inc(memcg->stat_numa->locality[idx]);
+	rcu_read_unlock();
+}
+#endif
+
 /* Universal VM events cgroup1 shows, original sort order */
 static const unsigned int memcg1_events[] = {
 	PGPGIN,
@@ -4435,6 +4476,9 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)

 	for_each_node(node)
 		free_mem_cgroup_per_node_info(memcg, node);
+#ifdef CONFIG_NUMA_BALANCING
+	free_percpu(memcg->stat_numa);
+#endif
 	free_percpu(memcg->vmstats_percpu);
 	kfree(memcg);
 }
@@ -4468,6 +4512,12 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	if (!memcg->vmstats_percpu)
 		goto fail;

+#ifdef CONFIG_NUMA_BALANCING
+	memcg->stat_numa = alloc_percpu(struct memcg_stat_numa);
+	if (!memcg->stat_numa)
+		goto fail;
+#endif
+
 	for_each_node(node)
 		if (alloc_mem_cgroup_per_node_info(memcg, node))
 			goto fail;
diff --git a/mm/memory.c b/mm/memory.c
index c0391a9f18b8..fb0c1d940d36 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3609,7 +3609,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page = NULL;
 	int page_nid = NUMA_NO_NODE;
-	int last_cpupid;
+	int last_cpupid = 0;
 	int target_nid;
 	bool migrated = false;
 	pte_t pte, old_pte;
@@ -3689,8 +3689,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		flags |= TNF_MIGRATE_FAIL;

 out:
-	if (page_nid != NUMA_NO_NODE)
-		task_numa_fault(last_cpupid, page_nid, 1, flags);
+	task_numa_fault(last_cpupid, page_nid, 1, flags);
 	return 0;
 }


From patchwork Mon Apr 22 02:12:20 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
X-Patchwork-Id: 10910565
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 06D8417E0
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:12:27 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EC19928686
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:12:26 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id DE50C286B3; Mon, 22 Apr 2019 02:12:26 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64,
	MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE,UNPARSEABLE_RELAY autolearn=ham
	version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6C8CE28686
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:12:26 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9F83F6B0007; Sun, 21 Apr 2019 22:12:25 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 980736B0008; Sun, 21 Apr 2019 22:12:25 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 849D26B000A; Sun, 21 Apr 2019 22:12:25 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-pf1-f197.google.com (mail-pf1-f197.google.com
 [209.85.210.197])
	by kanga.kvack.org (Postfix) with ESMTP id 476B56B0007
	for <linux-mm@kvack.org>; Sun, 21 Apr 2019 22:12:25 -0400 (EDT)
Received: by mail-pf1-f197.google.com with SMTP id l74so6704919pfb.23
        for <linux-mm@kvack.org>; Sun, 21 Apr 2019 19:12:25 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:subject:from
         :to:cc:references:message-id:date:user-agent:mime-version
         :in-reply-to:content-language:content-transfer-encoding;
        bh=pY40CoSYGz8qeu59Gtp+cDe+haKFoPl7cbgES2o6vTk=;
        b=DTlNZxGR6hgp/uQUUMRGJN1Ch0KY3uVY8IqhyTi4TaTCrEkwlmZS7apuVPd+WoIID7
         lCSUYvWjycJ1+C5AmLRrdujqBu9f/zplep+wEWLObRYQU5YG9T67o0Mj5sx4tHfnbTTa
         ppGQXSUjf0AyHx/m/Uc5idrhVykuyTifNKlCi6eUvuyRC2vO6umeF7THWmWv1JaWfgAe
         dglPMrQsTlbMHUJYCQ8Lt0EHj4dvXPVdM4H/lh/aXdR1kmnlSwLZ1RLDMFCi5GeaWp3g
         Ue0mW7ArYPJF6BXjtQg+CoZWGlIHZkR9kctcc6pPiz70LMzCaOaec5K5OLs6ULQkqbqx
         ZY5w==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.54 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
X-Gm-Message-State: APjAAAUUhbCqyxh1M9IIoDml1CFB8DBL+qp6s+7iO2wNjplItbVtaCt8
	N5z9WskVOqW3lgCr3xkYkpTOhWYQc5TFmLbqNAn2S9x6FYgIx853kCrLgyU/ZzKnm8POfg/7cjH
	5J2P0U5SWRM14xiM+UuAm5d6gZ19d3pfgu0OR2GVdEI43A5bWiB3311z6Vuf8sTKNLg==
X-Received: by 2002:a17:902:4101:: with SMTP id
 e1mr18068804pld.25.1555899144955;
        Sun, 21 Apr 2019 19:12:24 -0700 (PDT)
X-Google-Smtp-Source: 
 APXvYqx0NgfwGu844kUb3FjEkbaNPg57RaG47ionom/L6IsSb//AiAMS4B84RjNyzF3eYG4xiNTi
X-Received: by 2002:a17:902:4101:: with SMTP id
 e1mr18068743pld.25.1555899143851;
        Sun, 21 Apr 2019 19:12:23 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1555899143; cv=none;
        d=google.com; s=arc-20160816;
        b=TdpPb0CSEbHuFeIsnnWwdpNbq6GsCRbdu7PVc76SMoIh3EtlSzuqzPIwZKvwzJGcJw
         z4OudmYPb9e9uv7qmsj9toaP5xRJXGi8sShC7HbMlYyushfZctAfptDQWQeKrMvSUzpB
         aBJp82Lva/bPw09ADgPTr6pZEgbqkdUtPwTcLhmFpz4IPBEz4dYcGcKwKK2qxiN1vvSk
         71R7YAi5yopLPgnoWJFKEVt8hkXlxa+P5Re9GyTBQkmA76cb8W8gEA1e6vf1FcDPeFAg
         TE/nOfb3J6Bbk31O49ILvZPgQjovYj9DGjVs9w3NRCecUQhyPiyPqA0pHjpMxxjTfEfH
         L9aw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=content-transfer-encoding:content-language:in-reply-to:mime-version
         :user-agent:date:message-id:references:cc:to:from:subject;
        bh=pY40CoSYGz8qeu59Gtp+cDe+haKFoPl7cbgES2o6vTk=;
        b=MrdmQpMRgFHHU62NJQ/UjF2Opgfigao1Bmicatl/Nhpi3Ynt1MV3QTmPT9HdTPctK3
         oSYXEodNoiD24jtiU5EGcBhPLTbcRNwKf0AjJUD0zx7gYs4EwWRvAulPJenhuNh604GK
         NXF3SOv21g3CLfbw61PV4YecMHhEFyK9IuW+7QkXwwNySQJhNx0s6ZfxcAcRKuS6EA74
         TL9AqALRuZTcb7Ayaj8t9FlXWed17ywgK5GVwRvZPaqyvsBGAyu/T70n0a7whUWcvuZz
         y3wcQZpKOJ8KgFY5WQqAbNAB5KUAj8M2Vz5E+cq7xQDp8C4+Has/pIi17ALbDX9druZm
         51Mg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.54 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
Received: from out30-54.freemail.mail.aliyun.com
 (out30-54.freemail.mail.aliyun.com. [115.124.30.54])
        by mx.google.com with ESMTPS id t9si11743680plo.98.2019.04.21.19.12.23
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Sun, 21 Apr 2019 19:12:23 -0700 (PDT)
Received-SPF: pass (google.com: domain of yun.wang@linux.alibaba.com
 designates 115.124.30.54 as permitted sender) client-ip=115.124.30.54;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.54 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0TPtsv3i_1555899140;
Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com
 fp:SMTPD_---0TPtsv3i_1555899140)
          by smtp.aliyun-inc.com(127.0.0.1);
          Mon, 22 Apr 2019 10:12:21 +0800
Subject: [RFC PATCH 2/5] numa: append per-node execution info in
 memory.numa_stat
From: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
To: Peter Zijlstra <peterz@infradead.org>, hannes@cmpxchg.org,
 mhocko@kernel.org, vdavydov.dev@gmail.com, Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org
References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com>
Message-ID: <7be82809-79d3-f6a1-dfe8-dd14d2b35219@linux.alibaba.com>
Date: Mon, 22 Apr 2019 10:12:20 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0)
 Gecko/20100101 Thunderbird/60.6.1
MIME-Version: 1.0
In-Reply-To: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com>
Content-Language: en-US
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

This patch introduced numa execution information, to imply the numa
efficiency.

By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we
see new output line heading with 'exectime', like:

  exectime 24399843 27865444

which means the tasks of this cgroup executed 24399843 ticks on node 0,
and 27865444 ticks on node 1.

Combined with the memory node info, we can estimate the numa efficiency,
for example the memory.numa_stat show:

  total=4613257 N0=6849 N1=3928327
  ...
  exectime 24399843 27865444

there could be unmovable or cache pages on N1, then good locality could
mean nothing since we are not tracing these type of pages, thus bind the
workloads on the cpus of N1 worth a try, in order to achieve the maximum
performance bonus.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c            | 13 +++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bb62e6294484..e784d6252d5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -197,6 +197,7 @@ enum memcg_numa_locality_interval {

 struct memcg_stat_numa {
 	u64 locality[NR_NL_INTERVAL];
+	u64 exectime;
 };

 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b810d4e9c906..91bcd71fc38a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3409,6 +3409,18 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
 		seq_printf(m, " %llu", sum);
 	}
 	seq_putc(m, '\n');
+
+	seq_puts(m, "exectime");
+	for_each_online_node(nr) {
+		int cpu;
+		u64 sum = 0;
+
+		for_each_cpu(cpu, cpumask_of_node(nr))
+			sum += per_cpu(memcg->stat_numa->exectime, cpu);
+
+		seq_printf(m, " %llu", sum);
+	}
+	seq_putc(m, '\n');
 #endif

 	return 0;
@@ -3437,6 +3449,7 @@ void memcg_stat_numa_update(struct task_struct *p)
 	memcg = mem_cgroup_from_task(p);
 	if (idx != -1)
 		this_cpu_inc(memcg->stat_numa->locality[idx]);
+	this_cpu_inc(memcg->stat_numa->exectime);
 	rcu_read_unlock();
 }
 #endif

From patchwork Mon Apr 22 02:13:36 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
X-Patchwork-Id: 10910567
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BD964922
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:13:45 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AD6B228686
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:13:45 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id A1E7A286B3; Mon, 22 Apr 2019 02:13:45 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64,
	MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE,UNPARSEABLE_RELAY autolearn=ham
	version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 68B0D28686
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:13:44 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8234E6B0008; Sun, 21 Apr 2019 22:13:43 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7D1DF6B000A; Sun, 21 Apr 2019 22:13:43 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6C2716B000C; Sun, 21 Apr 2019 22:13:43 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-pg1-f200.google.com (mail-pg1-f200.google.com
 [209.85.215.200])
	by kanga.kvack.org (Postfix) with ESMTP id 30E166B0008
	for <linux-mm@kvack.org>; Sun, 21 Apr 2019 22:13:43 -0400 (EDT)
Received: by mail-pg1-f200.google.com with SMTP id o1so7091743pgv.15
        for <linux-mm@kvack.org>; Sun, 21 Apr 2019 19:13:43 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:subject:from
         :to:cc:references:message-id:date:user-agent:mime-version
         :in-reply-to:content-language:content-transfer-encoding;
        bh=nwfOC8RJhm6OU8t+xhI0HBAArOj5sRzoz/qLdrGHcDo=;
        b=o5T0WkAdKDIwq7CDd1mixgTWOdapIX8zJfJd0Erduqrl3Y7B/dTE2SFni7dweCQi9P
         BYJgDMl4chv+YBzOsznCgH5QWfkhbmzQRvEc0EYCME+i52ywnVmX7799ToiJQeOIbp6b
         IUBlLBcM8dmXjX3BFSH3jfM9qeLKQS6RwHG4Y3rabnAHQnuZSzbUwled0JpByC3ZJW6j
         kNQXXqULlzZlUoTPWnvbt9w0tYFkQJZmTXaiG+nCQb8nu6R7s9hWAZbhmtZsTcIiWNbF
         tWb6yHBm+RXfcKQ+rTsVEsz20S4bYWaItrDt85h7mzqXxCN8u6SBMvy+KeFsX3piw7ud
         5Mlw==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.131 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
X-Gm-Message-State: APjAAAUjHqRQmRNIGqBvfOT9U6NFDBjq06A8HbsUZjhAXdmPwjatVhpR
	7mPiDptdw+xUzlcX4jg75yJJ2XlA7BQCycFy2bHZ9f2L2nHgwEqaIkFWRW/aq/5XN8AW4eqvQrX
	ktcwyMp89qjVEQpVOHau4EKuEoMP2fFBDJsgAUr2Ab1guSFYT0rvhR3C0IDrrVNAyWg==
X-Received: by 2002:a17:902:5a3:: with SMTP id
 f32mr14978032plf.82.1555899222842;
        Sun, 21 Apr 2019 19:13:42 -0700 (PDT)
X-Google-Smtp-Source: 
 APXvYqzLWfDZSo2O6ohUs2dJyo2RRS7QCd4+qV6Q+waHDQRtG02w6ZURiEOsYPy17X7ATBVhoaVq
X-Received: by 2002:a17:902:5a3:: with SMTP id
 f32mr14977941plf.82.1555899221218;
        Sun, 21 Apr 2019 19:13:41 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1555899221; cv=none;
        d=google.com; s=arc-20160816;
        b=kLYOvRE/HzVD8unP1/Jnzl3rY8Io7PbzBjuNo6R2PoXNf75zY024j6wkMH9WQxz+vA
         GEs66RhDvUoGPs7uuUG9mi6qdRdrlncoDKx9p2O9NXgw3r2jh8F/PXMghFCxc/YWWCda
         /HRnHm3VoAcaJesCmZX3DbVYupiijgywdMq6mYBbDCFLQCFxba8PsWHYAid0uX/xSvPo
         1MOT8O3nJmwWcDtp1tjZvR0hhGI3pHQNT3vn3ei4CPN6oifR26WPKXn4whEujCPfgw7F
         77eCBsXclq661M+eeE0xNBkzq84IgMGFUhfhJdFVxwGRlnSDMaxr11Yh8sbPqEas1rMG
         LNwg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=content-transfer-encoding:content-language:in-reply-to:mime-version
         :user-agent:date:message-id:references:cc:to:from:subject;
        bh=nwfOC8RJhm6OU8t+xhI0HBAArOj5sRzoz/qLdrGHcDo=;
        b=WcINepwvnsBTkFcaxEO4zVzQ/0lfs/rCKfKhDfuNiwffMYR7HdJCFSX8jUs7oQ7CNl
         yj43UIwMlT97hXXClshRSs+oYskSsTLxHPWZjcy1HL/Tk1Mw5QwSrPcXx7WooXZe/XR4
         YCLYEQi+idpN08J9V6qMC2jFz4zF19G50daGd9Uk1GgqYryD98Hnow2SnnLX+ThLkK1y
         UqmQ5FbTpMZtGQ1GeuRmUAY/AFHFkraTcc8rnsBvVI5OwcS4nLXfLi12GoaiCTV8tzuZ
         TMrspywEy6YPiKJFI1ASjXDmLFvRphVOrqcvkKsy91r7G8yLGrTLqKDqCG+sDlQTFbMn
         /4Kg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.131 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
Received: from out30-131.freemail.mail.aliyun.com
 (out30-131.freemail.mail.aliyun.com. [115.124.30.131])
        by mx.google.com with ESMTPS id x5si12158544pfa.41.2019.04.21.19.13.40
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Sun, 21 Apr 2019 19:13:41 -0700 (PDT)
Received-SPF: pass (google.com: domain of yun.wang@linux.alibaba.com
 designates 115.124.30.131 as permitted sender) client-ip=115.124.30.131;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.131 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R171e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04400;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0TPtrPj4_1555899216;
Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com
 fp:SMTPD_---0TPtrPj4_1555899216)
          by smtp.aliyun-inc.com(127.0.0.1);
          Mon, 22 Apr 2019 10:13:36 +0800
Subject: [RFC PATCH 3/5] numa: introduce per-cgroup preferred numa node
From: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
To: Peter Zijlstra <peterz@infradead.org>, hannes@cmpxchg.org,
 mhocko@kernel.org, vdavydov.dev@gmail.com, Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org
References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com>
Message-ID: <77452c03-bc4c-7aed-e605-d5351f868586@linux.alibaba.com>
Date: Mon, 22 Apr 2019 10:13:36 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0)
 Gecko/20100101 Thunderbird/60.6.1
MIME-Version: 1.0
In-Reply-To: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com>
Content-Language: en-US
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

This patch add a new entry 'numa_preferred' for each memory cgroup,
by which we can now override the memory policy of the tasks inside
a particular cgroup, combined with numa balancing, we now be able to
migrate the workloads of a cgroup to the specified numa node, in gentle
way.

The load balancing and numa prefer against each other on CPU locations,
which lead into the situation that although a particular node is capable
enough to hold all the workloads, tasks will still spread.

In order to acquire the numa benifit in this situation,  load balancing
should respect the prefer decision as long as the balancing won't be
broken.

This patch try to forbid workloads leave memcg preferred node, when
and only when numa preferred node configured, in case if load balancing
can't find other tasks to move and keep failing, we will then giveup
and allow the migration to happen.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/memcontrol.h | 34 +++++++++++++++++++
 include/linux/sched.h      |  1 +
 kernel/sched/debug.c       |  1 +
 kernel/sched/fair.c        | 33 +++++++++++++++++++
 mm/huge_memory.c           |  3 ++
 mm/memcontrol.c            | 82 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/memory.c                |  4 +++
 mm/mempolicy.c             |  4 +++
 8 files changed, 162 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e784d6252d5e..0fd5eeb27c4f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -335,6 +335,8 @@ struct mem_cgroup {

 #ifdef CONFIG_NUMA_BALANCING
 	struct memcg_stat_numa __percpu *stat_numa;
+	s64 numa_preferred;
+	struct mutex numa_mutex;
 #endif

 	struct mem_cgroup_per_node *nodeinfo[0];
@@ -846,10 +848,26 @@ void mem_cgroup_split_huge_fixup(struct page *head);

 #ifdef CONFIG_NUMA_BALANCING
 extern void memcg_stat_numa_update(struct task_struct *p);
+extern int memcg_migrate_prep(int target_nid, int page_nid);
+extern int memcg_preferred_nid(struct task_struct *p, gfp_t gfp);
+extern struct page *alloc_page_numa_preferred(gfp_t gfp, unsigned int order);
 #else
 static inline void memcg_stat_numa_update(struct task_struct *p)
 {
 }
+static inline int memcg_migrate_prep(int target_nid, int page_nid)
+{
+	return target_nid;
+}
+static inline int memcg_preferred_nid(struct task_struct *p, gfp_t gfp)
+{
+	return -1;
+}
+static inline struct page *alloc_page_numa_preferred(gfp_t gfp,
+						     unsigned int order)
+{
+	return NULL;
+}
 #endif

 #else /* CONFIG_MEMCG */
@@ -1195,6 +1213,22 @@ static inline void memcg_stat_numa_update(struct task_struct *p)
 {
 }

+static inline int memcg_migrate_prep(int target_nid, int page_nid)
+{
+	return target_nid;
+}
+
+static inline int memcg_preferred_nid(struct task_struct *p, gfp_t gfp)
+{
+	return -1;
+}
+
+static inline struct page *alloc_page_numa_preferred(gfp_t gfp,
+						     unsigned int order)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_MEMCG */

 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0b01262d110d..9f931db1d31f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -422,6 +422,7 @@ struct sched_statistics {
 	u64				nr_migrations_cold;
 	u64				nr_failed_migrations_affine;
 	u64				nr_failed_migrations_running;
+	u64				nr_failed_migrations_memcg;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2898f5fa4fba..32f5fd66f0fe 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -934,6 +934,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(se.statistics.nr_migrations_cold);
 		P_SCHEDSTAT(se.statistics.nr_failed_migrations_affine);
 		P_SCHEDSTAT(se.statistics.nr_failed_migrations_running);
+		P_SCHEDSTAT(se.statistics.nr_failed_migrations_memcg);
 		P_SCHEDSTAT(se.statistics.nr_failed_migrations_hot);
 		P_SCHEDSTAT(se.statistics.nr_forced_migrations);
 		P_SCHEDSTAT(se.statistics.nr_wakeups);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba5a67139d57..5d0758e78b96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6701,6 +6701,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
 		/* Fast path */
+		int pnid = memcg_preferred_nid(p, 0);
+
+		if (pnid != NUMA_NO_NODE && pnid != cpu_to_node(new_cpu))
+			new_cpu = prev_cpu;

 		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

@@ -7404,12 +7408,36 @@ static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	return dst_weight < src_weight;
 }

+static inline bool memcg_migrate_allow(struct task_struct *p,
+					struct lb_env *env)
+{
+	int src_nid, dst_nid, pnid;
+
+	/* failed too much could imply balancing broken, now be a good boy */
+	if (env->sd->nr_balance_failed > env->sd->cache_nice_tries)
+		return true;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	pnid = memcg_preferred_nid(p, 0);
+	if (pnid != dst_nid && pnid == src_nid)
+		return false;
+
+	return true;
+}
 #else
 static inline int migrate_degrades_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
 	return -1;
 }
+
+static inline bool memcg_migrate_allow(struct task_struct *p,
+					struct lb_env *env)
+{
+	return true;
+}
 #endif

 /*
@@ -7470,6 +7498,11 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}

+	if (!memcg_migrate_allow(p, env)) {
+		schedstat_inc(p->se.statistics.nr_failed_migrations_memcg);
+		return 0;
+	}
+
 	/*
 	 * Aggressive migration if:
 	 * 1) destination numa is preferred
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2614ce725a63..c01e1bb22477 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1523,6 +1523,9 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	 */
 	page_locked = trylock_page(page);
 	target_nid = mpol_misplaced(page, vma, haddr);
+
+	target_nid = memcg_migrate_prep(target_nid, page_nid);
+
 	if (target_nid == NUMA_NO_NODE) {
 		/* If the page was locked, there are no parallel migrations */
 		if (page_locked)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 91bcd71fc38a..f1cb1e726430 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3452,6 +3452,79 @@ void memcg_stat_numa_update(struct task_struct *p)
 	this_cpu_inc(memcg->stat_numa->exectime);
 	rcu_read_unlock();
 }
+
+static s64 memcg_numa_preferred_read_s64(struct cgroup_subsys_state *css,
+				struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return memcg->numa_preferred;
+}
+
+static int memcg_numa_preferred_write_s64(struct cgroup_subsys_state *css,
+				struct cftype *cft, s64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	if (val != NUMA_NO_NODE && !node_isset(val, node_possible_map))
+		return -EINVAL;
+
+	mutex_lock(&memcg->numa_mutex);
+	memcg->numa_preferred = val;
+	mutex_unlock(&memcg->numa_mutex);
+
+	return 0;
+}
+
+int memcg_preferred_nid(struct task_struct *p, gfp_t gfp)
+{
+	int preferred_nid = NUMA_NO_NODE;
+
+	if (!mem_cgroup_disabled() &&
+	    !in_interrupt() &&
+	    !(gfp & __GFP_THISNODE)) {
+		struct mem_cgroup *memcg;
+
+		rcu_read_lock();
+		memcg = mem_cgroup_from_task(p);
+		if (memcg)
+			preferred_nid = memcg->numa_preferred;
+		rcu_read_unlock();
+	}
+
+	return preferred_nid;
+}
+
+int memcg_migrate_prep(int target_nid, int page_nid)
+{
+	bool ret = false;
+	unsigned int cookie;
+	int preferred_nid = memcg_preferred_nid(current, 0);
+
+	if (preferred_nid == NUMA_NO_NODE)
+		return target_nid;
+
+	do {
+		cookie = read_mems_allowed_begin();
+		ret = node_isset(preferred_nid, current->mems_allowed);
+	} while (read_mems_allowed_retry(cookie));
+
+	if (ret)
+		return page_nid == preferred_nid ? NUMA_NO_NODE : preferred_nid;
+
+	return target_nid;
+}
+
+struct page *alloc_page_numa_preferred(gfp_t gfp, unsigned int order)
+{
+	int pnid = memcg_preferred_nid(current, gfp);
+
+	if (pnid == NUMA_NO_NODE || !node_isset(pnid, current->mems_allowed))
+		return NULL;
+
+	return __alloc_pages_node(pnid, gfp, order);
+}
+
 #endif

 /* Universal VM events cgroup1 shows, original sort order */
@@ -4309,6 +4382,13 @@ static struct cftype mem_cgroup_legacy_files[] = {
 		.name = "numa_stat",
 		.seq_show = memcg_numa_stat_show,
 	},
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+	{
+		.name = "numa_preferred",
+		.read_s64 = memcg_numa_preferred_read_s64,
+		.write_s64 = memcg_numa_preferred_write_s64,
+	},
 #endif
 	{
 		.name = "kmem.limit_in_bytes",
@@ -4529,6 +4609,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->stat_numa = alloc_percpu(struct memcg_stat_numa);
 	if (!memcg->stat_numa)
 		goto fail;
+	mutex_init(&memcg->numa_mutex);
+	memcg->numa_preferred = NUMA_NO_NODE;
 #endif

 	for_each_node(node)
diff --git a/mm/memory.c b/mm/memory.c
index fb0c1d940d36..98d988ca717c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -70,6 +70,7 @@
 #include <linux/dax.h>
 #include <linux/oom.h>
 #include <linux/numa.h>
+#include <linux/memcontrol.h>

 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -3675,6 +3676,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid,
 			&flags);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+	target_nid = memcg_migrate_prep(target_nid, page_nid);
+
 	if (target_nid == NUMA_NO_NODE) {
 		put_page(page);
 		goto out;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index af171ccb56a2..6513504373b4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2031,6 +2031,10 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,

 	pol = get_vma_policy(vma, addr);

+	page = alloc_page_numa_preferred(gfp, order);
+	if (page)
+		goto out;
+
 	if (pol->mode == MPOL_INTERLEAVE) {
 		unsigned nid;


From patchwork Mon Apr 22 02:14:48 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
X-Patchwork-Id: 10910569
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1539E922
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:14:56 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 055E1200CB
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:14:56 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id EDA9127CF3; Mon, 22 Apr 2019 02:14:55 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64,
	MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE,UNPARSEABLE_RELAY autolearn=ham
	version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 51B9A200CB
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:14:55 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 785666B000A; Sun, 21 Apr 2019 22:14:54 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7335D6B000C; Sun, 21 Apr 2019 22:14:54 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 626216B000D; Sun, 21 Apr 2019 22:14:54 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-pl1-f198.google.com (mail-pl1-f198.google.com
 [209.85.214.198])
	by kanga.kvack.org (Postfix) with ESMTP id 270C56B000A
	for <linux-mm@kvack.org>; Sun, 21 Apr 2019 22:14:54 -0400 (EDT)
Received: by mail-pl1-f198.google.com with SMTP id b8so2116201pls.22
        for <linux-mm@kvack.org>; Sun, 21 Apr 2019 19:14:54 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:subject:from
         :to:cc:references:message-id:date:user-agent:mime-version
         :in-reply-to:content-language:content-transfer-encoding;
        bh=BgNh+vAF7phhzpkVy3grjGwDPnewjQrXNbDOE1trMjs=;
        b=Llgkq60dqn2gcEJfYiMGbhv7RG2xgBGh3zY9NUpf/b3/3g9tsKppJC50aPvlwexS0y
         16XLsFqi9EnApHwhl0cF7py7tHUkwdOYlPDl7mTSnEvpDCRonaH/yoP13FqIGZc69wT6
         K7nHdEH6/xOMCfBuGSBTE2uP6AGNNjOcPQ30oz4cUcmm5eW9ckxkzAbocXcis3E5OrCp
         nTcnhH/XKoS7Nu47jXZYCQ9kEAettu+7E2l7N56E+NrIJ0bt/JQdEv/a8OyRzvfNXCfe
         cCLtTU+5vWqq3va/WeFfenvg77IcxUE9WDy9nBfofFEFuXgM+WaJ8aTNvxAkjkp5v0kp
         XJnw==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.132 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
X-Gm-Message-State: APjAAAWUgD/uwcpSEjjcDffxCTmdS9+yxbypSub8JituVgC6qbae2jXs
	8Gapbl3kbXGx1hG1Wav1p2rfspf51qKqeQzKtjTjmnviJtz4rYj9egLA2qPZGeEiLHV+bvR/prS
	Z/nPGjNcxmVHuZcM+zGbcdOzhQNAHNmn9kgWiZ3lr6JD0DnTXwVWJM3IZNP50lcY8fg==
X-Received: by 2002:a63:d607:: with SMTP id q7mr7447641pgg.213.1555899293707;
        Sun, 21 Apr 2019 19:14:53 -0700 (PDT)
X-Google-Smtp-Source: 
 APXvYqyY2HLB9po3VMY2p33kTuDICW84evdOPN0ewSy5cEZWq55FcDMy3fj1xiWh4xjr6Wcn8RHF
X-Received: by 2002:a63:d607:: with SMTP id q7mr7447576pgg.213.1555899292344;
        Sun, 21 Apr 2019 19:14:52 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1555899292; cv=none;
        d=google.com; s=arc-20160816;
        b=c81QQFhLoE072C66UgA8w++06ZzyBBZf1bGZTRArUZICPohNSTH6EYXcIgaDdxvUyB
         NhT6VBoDbeSd/Hk0pnY2LKcYAMYdOZ3V2aZq0jU6ezFGczNGi+4H8ufajw4QzEHKpp1t
         imVDHDEHnpfYc2aShUBwHnTKMlc5pB7m2fIs1C66Q4mPFXfoFTGKS1JyI6FskzNrMETI
         vbG64La9flYL2rkMAEO0+rLLz5EnYMe8fnaUzdquyAkwf6I0TwpLZ9soDPJTosw+uWr7
         SR5qn2ToEBEegTtvfKa+l2g9xJ83Sl/XR1GAFK9kO7sGx505WhYgGBj+urFjkxgT9Iyf
         xAdA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=content-transfer-encoding:content-language:in-reply-to:mime-version
         :user-agent:date:message-id:references:cc:to:from:subject;
        bh=BgNh+vAF7phhzpkVy3grjGwDPnewjQrXNbDOE1trMjs=;
        b=OomCPbNfKzKplUMRtgXRg448gCkZJxfro2fQ8bcWjpZ46QhrFpA+DSSJg01WK1bzD1
         oAdX9c31yVpY98jM8lCmqWJA8iaH0wOTMrVRPjwdXk1c0NquLdam4f6n6lnMsJ/N/Vz6
         IrLeURrBg3PZ2FW3dfiljZ2i0dC35iBUv8UHnS5K0zzPyifox2GVqCRxGf8hGeWW8qcb
         8aH19wt/iyBiktDvm5ZRrUzg8UjrWSjS0XXV1KOqzNBac2VVbXBeBB+qV5ICSkmQb72s
         26juSRvJaeQJTgVG/Ggwn4lOCxHSSZrejvs8ZaVXCKJWvs83haRCqIQ7lRBwVKhR/Yr+
         gaPA==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.132 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
Received: from out30-132.freemail.mail.aliyun.com
 (out30-132.freemail.mail.aliyun.com. [115.124.30.132])
        by mx.google.com with ESMTPS id
 g20si12297281pfh.226.2019.04.21.19.14.51
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Sun, 21 Apr 2019 19:14:52 -0700 (PDT)
Received-SPF: pass (google.com: domain of yun.wang@linux.alibaba.com
 designates 115.124.30.132 as permitted sender) client-ip=115.124.30.132;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.132 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R401e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04423;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0TPtroFq_1555899288;
Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com
 fp:SMTPD_---0TPtroFq_1555899288)
          by smtp.aliyun-inc.com(127.0.0.1);
          Mon, 22 Apr 2019 10:14:48 +0800
Subject: [RFC PATCH 4/5] numa: introduce numa balancer infrastructure
From: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
To: Peter Zijlstra <peterz@infradead.org>, hannes@cmpxchg.org,
 mhocko@kernel.org, vdavydov.dev@gmail.com, Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org
References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com>
Message-ID: <42f47daa-22bb-3c93-9939-1514eb3bbda4@linux.alibaba.com>
Date: Mon, 22 Apr 2019 10:14:48 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0)
 Gecko/20100101 Thunderbird/60.6.1
MIME-Version: 1.0
In-Reply-To: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com>
Content-Language: en-US
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

Now we have the way to estimate and adjust numa preferred node for each
memcg, next problem is how to use them.

Usually one will bind workloads with cpuset.cpus, combined with cpuset.mems
or maybe better the memory policy to achieve numa bonus, however in complicated
scenery like combined type of workloads or cpushare way of isolation, this
kind of administration could make one crazy, what we need is a way to gain
numa bonus automatically, maybe not maximum but as much as possible.

This patch introduced basic API for kernel module to do numa adjustment,
later coming the numa balancer module to use them and try to gain numa bonus
as much as possible, automatically.

API including:
  * numa preferred control
  * memcg callback hook
  * memcg per-node page number acquire

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/memcontrol.h |  26 ++++++++++++
 mm/memcontrol.c            | 101 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 127 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0fd5eeb27c4f..7456b862d5a9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,11 @@ struct memcg_stat_numa {
 	u64 exectime;
 };

+struct memcg_callback {
+	void (*init)(struct mem_cgroup *memcg);
+	void (*exit)(struct mem_cgroup *memcg);
+};
+
 #endif
 #if defined(CONFIG_SMP)
 struct memcg_padding {
@@ -337,6 +342,8 @@ struct mem_cgroup {
 	struct memcg_stat_numa __percpu *stat_numa;
 	s64 numa_preferred;
 	struct mutex numa_mutex;
+	void *numa_private;
+	struct list_head numa_list;
 #endif

 	struct mem_cgroup_per_node *nodeinfo[0];
@@ -851,6 +858,10 @@ extern void memcg_stat_numa_update(struct task_struct *p);
 extern int memcg_migrate_prep(int target_nid, int page_nid);
 extern int memcg_preferred_nid(struct task_struct *p, gfp_t gfp);
 extern struct page *alloc_page_numa_preferred(gfp_t gfp, unsigned int order);
+extern int register_memcg_callback(void *cb);
+extern int unregister_memcg_callback(void *cb);
+extern void config_numa_preferred(struct mem_cgroup *memcg, int nid);
+extern u64 memcg_numa_pages(struct mem_cgroup *memcg, int nid, u32 mask);
 #else
 static inline void memcg_stat_numa_update(struct task_struct *p)
 {
@@ -868,6 +879,21 @@ static inline struct page *alloc_page_numa_preferred(gfp_t gfp,
 {
 	return NULL;
 }
+static inline int register_memcg_callback(void *cb)
+{
+	return -EINVAL;
+}
+static inline int unregister_memcg_callback(void *cb)
+{
+	return -EINVAL;
+}
+static inline void config_numa_preferred(struct mem_cgroup *memcg, int nid)
+{
+}
+static inline u64 memcg_numa_pages(struct mem_cgroup *memcg, int nid, u32 mask)
+{
+	return 0;
+}
 #endif

 #else /* CONFIG_MEMCG */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f1cb1e726430..dc232ecc904f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3525,6 +3525,102 @@ struct page *alloc_page_numa_preferred(gfp_t gfp, unsigned int order)
 	return __alloc_pages_node(pnid, gfp, order);
 }

+static struct memcg_callback *memcg_cb;
+
+static LIST_HEAD(memcg_cb_list);
+static DEFINE_MUTEX(memcg_cb_mutex);
+
+int register_memcg_callback(void *cb)
+{
+	int ret = 0;
+
+	mutex_lock(&memcg_cb_mutex);
+	if (memcg_cb || !cb) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	memcg_cb = (struct memcg_callback *)cb;
+	if (memcg_cb->init) {
+		struct mem_cgroup *memcg;
+
+		list_for_each_entry(memcg, &memcg_cb_list, numa_list)
+			memcg_cb->init(memcg);
+	}
+
+out:
+	mutex_unlock(&memcg_cb_mutex);
+	return ret;
+}
+EXPORT_SYMBOL(register_memcg_callback);
+
+int unregister_memcg_callback(void *cb)
+{
+	int ret = 0;
+
+	mutex_lock(&memcg_cb_mutex);
+	if (!memcg_cb || memcg_cb != cb) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (memcg_cb->exit) {
+		struct mem_cgroup *memcg;
+
+		list_for_each_entry(memcg, &memcg_cb_list, numa_list)
+			memcg_cb->exit(memcg);
+	}
+	memcg_cb = NULL;
+
+out:
+	mutex_unlock(&memcg_cb_mutex);
+	return ret;
+}
+EXPORT_SYMBOL(unregister_memcg_callback);
+
+void config_numa_preferred(struct mem_cgroup *memcg, int nid)
+{
+	mutex_lock(&memcg->numa_mutex);
+	memcg->numa_preferred = nid;
+	mutex_unlock(&memcg->numa_mutex);
+}
+EXPORT_SYMBOL(config_numa_preferred);
+
+u64 memcg_numa_pages(struct mem_cgroup *memcg, int nid, u32 mask)
+{
+	if (nid == NUMA_NO_NODE)
+		return mem_cgroup_nr_lru_pages(memcg, mask);
+	else
+		return mem_cgroup_node_nr_lru_pages(memcg, nid, mask);
+}
+EXPORT_SYMBOL(memcg_numa_pages);
+
+static void memcg_online_callback(struct mem_cgroup *memcg)
+{
+	mutex_lock(&memcg_cb_mutex);
+	list_add_tail(&memcg->numa_list, &memcg_cb_list);
+	if (memcg_cb && memcg_cb->init)
+		memcg_cb->init(memcg);
+	mutex_unlock(&memcg_cb_mutex);
+}
+
+static void memcg_offline_callback(struct mem_cgroup *memcg)
+{
+	mutex_lock(&memcg_cb_mutex);
+	if (memcg_cb && memcg_cb->exit)
+		memcg_cb->exit(memcg);
+	list_del_init(&memcg->numa_list);
+	mutex_unlock(&memcg_cb_mutex);
+}
+
+#else
+
+static void memcg_online_callback(struct mem_cgroup *memcg)
+{}
+
+static void memcg_offline_callback(struct mem_cgroup *memcg)
+{}
+
 #endif

 /* Universal VM events cgroup1 shows, original sort order */
@@ -4719,6 +4815,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	/* Online state pins memcg ID, memcg ID pins CSS */
 	refcount_set(&memcg->id.ref, 1);
 	css_get(css);
+
+	memcg_online_callback(memcg);
+
 	return 0;
 }

@@ -4727,6 +4826,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	struct mem_cgroup_event *event, *tmp;

+	memcg_offline_callback(memcg);
+
 	/*
 	 * Unregister events and notify userspace.
 	 * Notify userspace about cgroup removing only after rmdir of cgroup

From patchwork Mon Apr 22 02:21:17 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
X-Patchwork-Id: 10910571
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 019CA1575
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:21:29 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DCF99286C4
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:21:29 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id D089C286DA; Mon, 22 Apr 2019 02:21:29 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.4 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64,
	FUZZY_AMBIEN,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE,UNPARSEABLE_RELAY
	autolearn=no version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5212C286C4
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Mon, 22 Apr 2019 02:21:28 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4C2466B0003; Sun, 21 Apr 2019 22:21:27 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4497C6B0006; Sun, 21 Apr 2019 22:21:27 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3107F6B0007; Sun, 21 Apr 2019 22:21:27 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-io1-f69.google.com (mail-io1-f69.google.com
 [209.85.166.69])
	by kanga.kvack.org (Postfix) with ESMTP id 07D776B0003
	for <linux-mm@kvack.org>; Sun, 21 Apr 2019 22:21:27 -0400 (EDT)
Received: by mail-io1-f69.google.com with SMTP id e126so9550651ioa.8
        for <linux-mm@kvack.org>; Sun, 21 Apr 2019 19:21:27 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:subject:from
         :to:cc:references:message-id:date:user-agent:mime-version
         :in-reply-to:content-language:content-transfer-encoding;
        bh=pNKqxxisevrqoVNik0W1z3Dx1UNFLoY3GZippUeeDZI=;
        b=VqP56Jt22MJB5/smQgzETUxUzirGRY5Z3OeHinCL+a/pq07orhib9Nei1Ikifd/sDv
         o9JqEK7ToX8aLFUwmK3W/9ypunVJMChJorjqQaLK5oSF/sBqg7327jCVkoKYrIcmagqE
         2G8aPDMHtxkCU81+kzleKkXMVcrg5VBHdsbqZJItyn56c2rChOqO88f5MC39WbrojG3+
         YYijBComVHfkJ8WX2AZrxMYWmJspcxy8GosTuTPjUTHV1Q2m6EpAD0Ds692LnW12apWT
         yj3xnRJY+lV9ZYkizjjNgT/Qfk6v2oHiuSNupKsijZzcGbZYJn0mpPuRjW+Agt0KBFRS
         3aEw==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.44 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
X-Gm-Message-State: APjAAAV8znl1vA299LL5g+oP6A6q3kOSv9uj4rLBnwuHuxX4T51WUaPx
	NuKEEQVraS+7Qp5Th7nRZnHO6tSR/VLzB4VjrULpGs8KuuPKNc2CqZfWDebS4Fb3k4FYOr7ZGAa
	4AxHVEOQ7UWWtD3ySpOPz5S1/8yF98nh3tkRit4Bjhnq6yOxCckCXQnHHH3YLavUi9w==
X-Received: by 2002:a24:2e4f:: with SMTP id
 i76mr11366219ita.171.1555899686599;
        Sun, 21 Apr 2019 19:21:26 -0700 (PDT)
X-Google-Smtp-Source: 
 APXvYqy3BHLbZC4Inq9PphCpv0n+TgiukSVpNO2AJVjIHOH8ExEhiUGz2PWmMpD2WjL+R0kiu93b
X-Received: by 2002:a24:2e4f:: with SMTP id
 i76mr11366173ita.171.1555899684806;
        Sun, 21 Apr 2019 19:21:24 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1555899684; cv=none;
        d=google.com; s=arc-20160816;
        b=FJa+zXy5Kau18hHd10UIvhO2RjRBqYxN9mDJXW5d+a28EbDuQGrRiGKIAS9IqEZNtB
         MwWQeSTumsM27sIBu0AEkucBvXfNVH4clSkaQZxfOMDW4DNEDspk1a62c4Bmo/h5YFba
         tDqdZoP1du9k7khJ8ftejisFIqwlqJkOAO6uzxOxgN+QqVBEvsNbKzx7wAbskd6lAlHh
         tEU3q9T/fqLKBU3z90RnqfhcHAIKGODDOk1RGFNe7BhT1iCs4S1PJYbr86P5wou+I0/S
         +gzggUbxtc3SR9hM4GUIqkqcgnpVM93LNy1S+lwdrPuLVV7Ry6b7tapluy1aqX42l0N7
         y3Zg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=content-transfer-encoding:content-language:in-reply-to:mime-version
         :user-agent:date:message-id:references:cc:to:from:subject;
        bh=pNKqxxisevrqoVNik0W1z3Dx1UNFLoY3GZippUeeDZI=;
        b=ExD4zCwUjJYrHmZZLGIsgYd7VCjiRpSHR7AJpxz2yYEYDU1S4rkJSu0PGvfVtr+qpV
         W2IPZNp21rRbr4uBJL7Ll8NeLjoXw04Xzzt6ArTuA//EK78QmKg1EGGG7eG8ZeCS1vpv
         LL9LcqhAnEf3BvhDKjuYi9B+0dTgE0C3gsBAkm74dSI87Yt3sCpUOycUV4EJetk4z+yH
         SFOj4FQDh0I8Trm9fSvYD2CIxh3Zu4hDIQE+DAXaXjgegZnY4YFb0v5FHQphjhNzHkuq
         Uf4wls7xATj7gSm7ljoYG+GfDfl+K3euQVTPKmbTBCagNo1AxUPMzPd3AX6fGnpLAPkw
         M9aw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.44 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
Received: from out30-44.freemail.mail.aliyun.com
 (out30-44.freemail.mail.aliyun.com. [115.124.30.44])
        by mx.google.com with ESMTPS id t82si7765527itb.5.2019.04.21.19.21.22
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Sun, 21 Apr 2019 19:21:24 -0700 (PDT)
Received-SPF: pass (google.com: domain of yun.wang@linux.alibaba.com
 designates 115.124.30.44 as permitted sender) client-ip=115.124.30.44;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates
 115.124.30.44 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0TPtsYiU_1555899677;
Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com
 fp:SMTPD_---0TPtsYiU_1555899677)
          by smtp.aliyun-inc.com(127.0.0.1);
          Mon, 22 Apr 2019 10:21:17 +0800
Subject: [RFC PATCH 5/5] numa: numa balancer
From: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
To: Peter Zijlstra <peterz@infradead.org>, hannes@cmpxchg.org,
 mhocko@kernel.org, vdavydov.dev@gmail.com, Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org
References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com>
Message-ID: <85bcd381-ef27-ddda-6069-1f1d80cf296a@linux.alibaba.com>
Date: Mon, 22 Apr 2019 10:21:17 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0)
 Gecko/20100101 Thunderbird/60.6.1
MIME-Version: 1.0
In-Reply-To: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com>
Content-Language: en-US
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

numa balancer is a module which will try to automatically adjust numa
balancing stuff to gain numa bonus as much as possible.

For each memory cgroup, we process the work in two steps:

On stage 1 we check cgroup's exectime and memory topology to see
if there could be a candidate for settled down, if we got one then
move onto stage 2.

On stage 2 we try to settle down as much as possible by prefer the
candidate node, if the node no longer suitable or locality keep
downturn, we reset things and new round begin.

Decision made with find_candidate_nid(), should_prefer() and keep_prefer(),
which try to pick a candidate node, see if allowed to prefer it and if
keep doing the prefer.

Tested on the box with 96 cpus with sysbench-mysql-oltp_read_write
testing, 4 mysqld instances created and attached to 4 cgroups, 4
sysbench instances then created and attached to corresponding cgroup
to test the mysql with oltp_read_write script, average eps show:

				origin		balancer
4 instances each 12 threads	5241.08		5375.59		+2.50%
4 instances each 24 threads	7497.29		7820.73		+4.13%
4 instances each 36 threads	8985.44		9317.04		+3.55%
4 instances each 48 threads	9716.50		9982.60		+2.66%

Other benchmark liks dbench, pgbench, perf bench numa also tested, and
with different parameters and number of instances/threads, most of
the cases show bonus, some show acceptable regression, and some got no
changes.

TODO:
  * improve the logical to address the regression cases
  * Find a way, maybe, to handle the page cache left on remote
  * find more scenery which could gain benefit

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 drivers/Makefile             |   1 +
 drivers/numa/Makefile        |   1 +
 drivers/numa/numa_balancer.c | 715 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 717 insertions(+)
 create mode 100644 drivers/numa/Makefile
 create mode 100644 drivers/numa/numa_balancer.c

diff --git a/drivers/Makefile b/drivers/Makefile
index c61cde554340..f07936b03870 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -187,3 +187,4 @@ obj-$(CONFIG_UNISYS_VISORBUS)	+= visorbus/
 obj-$(CONFIG_SIOX)		+= siox/
 obj-$(CONFIG_GNSS)		+= gnss/
 obj-$(CONFIG_INTERCONNECT)	+= interconnect/
+obj-$(CONFIG_NUMA_BALANCING)	+= numa/
diff --git a/drivers/numa/Makefile b/drivers/numa/Makefile
new file mode 100644
index 000000000000..acf8a4083333
--- /dev/null
+++ b/drivers/numa/Makefile
@@ -0,0 +1 @@
+obj-m	+= numa_balancer.o
diff --git a/drivers/numa/numa_balancer.c b/drivers/numa/numa_balancer.c
new file mode 100644
index 000000000000..25bbe08c82a2
--- /dev/null
+++ b/drivers/numa/numa_balancer.c
@@ -0,0 +1,715 @@
+/*
+ * NUMA Balancer
+ *
+ *  Copyright (C) 2019 Alibaba Group Holding Limited.
+ *  Author: Michael Wang <yun.wang@linux.alibaba.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/module.h>
+#include <linux/memcontrol.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/kthread.h>
+#include <linux/kernel_stat.h>
+#include <linux/vmstat.h>
+
+static unsigned int debug_level;
+module_param(debug_level, uint, 0644);
+MODULE_PARM_DESC(debug_level, "1 to print decisions, 2 to print both decisions and node info");
+
+static int prefer_level = 10;
+module_param(prefer_level, int, 0644);
+MODULE_PARM_DESC(prefer_level, "stop numa prefer when reach this much continuous downturn, 0 means no prefer");
+
+static unsigned int locality_level = PERCENT_70_79;
+module_param(locality_level, uint, 0644);
+MODULE_PARM_DESC(locality_level, "consider locality as good when above this sector");
+
+static unsigned long period_max = (600 * HZ);
+module_param(period_max, ulong, 0644);
+MODULE_PARM_DESC(period_max, "maximum period between each stage");
+
+static unsigned long period_min = (5 * HZ);
+module_param(period_min, ulong, 0644);
+MODULE_PARM_DESC(period_min, "minimum period between each stage");
+
+static unsigned int cpu_high_wmark = 100;
+module_param(cpu_high_wmark, uint, 0644);
+MODULE_PARM_DESC(cpu_high_wmark, "respect the execution percent rather than memory percent when above this cpu usage");
+
+static unsigned int cpu_low_wmark = 10;
+module_param(cpu_low_wmark, uint, 0644);
+MODULE_PARM_DESC(cpu_low_wmark, "consider cgroup as active when above this cpu usage");
+
+static unsigned int free_low_wmark = 10;
+module_param(free_low_wmark, uint, 0644);
+MODULE_PARM_DESC(free_low_wmark, "consider node as consumed out when below this free percent");
+
+static unsigned int candidate_wmark = 60;
+module_param(candidate_wmark, uint, 0644);
+MODULE_PARM_DESC(candidate_wmark, "consider node as candidate when above this execution time or memory percent");
+
+static unsigned int settled_wmark = 90;
+module_param(settled_wmark, uint, 0644);
+MODULE_PARM_DESC(settled_wmark, "consider cgroup settle down on node when above this execution time and memory percent, or locality");
+
+/*
+ * STAGE_1 -- no preferred node
+ *
+ * STAGE_2 -- preferred node setted
+ *
+ * Check handlers for details.
+ */
+enum {
+	STAGE_1,
+	STAGE_2,
+	NR_STAGES,
+};
+
+struct node_info {
+	u64 anon;
+	u64 pages;
+	u64 exectime;
+	u64 exectime_history;
+
+	u64 ticks;
+	u64 ticks_history;
+	u64 idle;
+	u64 idle_history;
+
+	u64 total_pages;
+	u64 free_pages;
+
+	unsigned int exectime_percent;
+	unsigned int last_exectime_percent;
+	unsigned int anon_percent;
+	unsigned int pages_percent;
+	unsigned int free_percent;
+	unsigned int idle_percent;
+	unsigned int cpu_usage;
+};
+
+struct numa_balancer {
+	struct delayed_work dwork;
+	struct mem_cgroup *memcg;
+	struct node_info *ni;
+
+	unsigned long period;
+	unsigned long jstamp;
+
+	u64 locality_good;
+	u64 locality_sum;
+	u64 anon_sum;
+	u64 pages_sum;
+	u64 exectime_sum;
+	u64 free_pages_sum;
+
+	unsigned int stage;
+	unsigned int cpu_usage_sum;
+	unsigned int locality_score;
+	unsigned int downturn;
+
+	int anon_max_nid;
+	int pages_max_nid;
+	int exectime_max_nid;
+	int candidate_nid;
+};
+
+static struct workqueue_struct *numa_balancer_wq;
+
+/*
+ * Kernel increase the locality counter when hit memcg's task running
+ * on each tick, classified according to the percentage of local page
+ * access.
+ *
+ * This can representing the NUMA benefit, higher the locality lower
+ * the memory access latency, thus we calculate a score here to tell
+ * how well the memcg is playing with NUMA.
+ *
+ * The score is simplly the percentage of ticks above locality_level,
+ * which usually from 0 to 100, -1 means no ticks.
+ *
+ * For example, score 90 with locality_level 7 means there are 90
+ * percentage of the ticks hit memcg's tasks running above 79% local
+ * page access on numa page fault.
+ */
+static inline void update_locality_score(struct numa_balancer *nb)
+{
+	int i, cpu;
+	u64 good, sum, tmp;
+	unsigned int last_locality_score = nb->locality_score;
+	struct memcg_stat_numa *stat = nb->memcg->stat_numa;
+
+	nb->locality_score = -1;
+
+	for (good = sum = i = 0; i < NR_NL_INTERVAL; i++) {
+		for_each_possible_cpu(cpu) {
+			u64 val = per_cpu(stat->locality[i], cpu);
+
+			good += i > locality_level ? val : 0;
+			sum += val;
+		}
+	}
+
+	tmp = nb->locality_good;
+	nb->locality_good = good;
+	good -= tmp;
+
+	tmp = nb->locality_sum;
+	nb->locality_sum = sum;
+	sum -= tmp;
+
+	if (sum)
+		nb->locality_score = (good * 100) / sum;
+
+	if (nb->locality_score == -1 ||
+	    nb->locality_score > settled_wmark ||
+	    nb->locality_score > last_locality_score)
+		nb->downturn = 0;
+	else
+		nb->downturn++;
+}
+
+static inline void update_numa_info(struct numa_balancer *nb)
+{
+	int nid;
+	unsigned long period_in_jiffies = jiffies - nb->jstamp;
+	struct memcg_stat_numa *stat = nb->memcg->stat_numa;
+
+	if (period_in_jiffies <= 0)
+		return;
+
+	nb->anon_sum = nb->pages_sum = nb->exectime_sum = 0;
+	nb->anon_max_nid = nb->pages_max_nid = nb->exectime_max_nid = 0;
+
+	nb->free_pages_sum = 0;
+
+	for_each_online_node(nid) {
+		int cpu, zid;
+		u64 idle_curr, ticks_curr, exectime_curr;
+		struct node_info *nip = &nb->ni[nid];
+
+		nip->total_pages = nip->free_pages = 0;
+		for (zid = 0; zid <= ZONE_MOVABLE; zid++) {
+			pg_data_t *pgdat = NODE_DATA(nid);
+			struct zone *z = &pgdat->node_zones[zid];
+
+			nip->total_pages += zone_managed_pages(z);
+			nip->free_pages  += zone_page_state(z, NR_FREE_PAGES);
+		}
+
+		idle_curr = ticks_curr = exectime_curr = 0;
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			u64 *cstat = kcpustat_cpu(cpu).cpustat;
+
+			/* not accurate but fine */
+			idle_curr += cstat[CPUTIME_IDLE];
+			ticks_curr +=
+				cstat[CPUTIME_USER] + cstat[CPUTIME_NICE] +
+				cstat[CPUTIME_SYSTEM] + cstat[CPUTIME_IDLE] +
+				cstat[CPUTIME_IOWAIT] + cstat[CPUTIME_IRQ] +
+				cstat[CPUTIME_SOFTIRQ] + cstat[CPUTIME_STEAL];
+
+			exectime_curr += per_cpu(stat->exectime, cpu);
+		}
+
+		nip->ticks = ticks_curr - nip->ticks_history;
+		nip->ticks_history = ticks_curr;
+
+		nip->idle = idle_curr - nip->idle_history;
+		nip->idle_history = idle_curr;
+
+		nip->idle_percent = nip->idle * 100 / nip->ticks;
+
+		nip->exectime = exectime_curr - nip->exectime_history;
+		nip->exectime_history = exectime_curr;
+
+		nip->anon = memcg_numa_pages(nb->memcg, nid, LRU_ALL_ANON);
+		nip->pages = memcg_numa_pages(nb->memcg, nid, LRU_ALL);
+
+		if (nip->anon > nb->ni[nb->anon_max_nid].anon)
+			nb->anon_max_nid = nid;
+
+		if (nip->pages > nb->ni[nb->pages_max_nid].pages)
+			nb->pages_max_nid = nid;
+
+		if (nip->exectime > nb->ni[nb->exectime_max_nid].exectime)
+			nb->exectime_max_nid = nid;
+
+		nb->anon_sum += nip->anon;
+		nb->pages_sum += nip->pages;
+		nb->exectime_sum += nip->exectime;
+		nb->free_pages_sum += nip->free_pages;
+	}
+
+	for_each_online_node(nid) {
+		struct node_info *nip = &nb->ni[nid];
+
+		nip->last_exectime_percent = nip->exectime_percent;
+		nip->exectime_percent = nb->exectime_sum ?
+			nip->exectime * 100 / nb->exectime_sum : 0;
+
+		nip->anon_percent = nb->anon_sum ?
+			nip->anon * 100 / nb->anon_sum : 0;
+
+		nip->pages_percent = nb->pages_sum ?
+			nip->pages * 100 / nb->pages_sum : 0;
+
+		nip->free_percent = nip->total_pages ?
+			nip->free_pages * 100 / nip->total_pages : 0;
+
+		nip->cpu_usage = nip->exectime * 100 / period_in_jiffies;
+	}
+
+	nb->cpu_usage_sum = nb->exectime_sum * 100 / period_in_jiffies;
+	nb->jstamp = jiffies;
+}
+
+/*
+ * We consider a node as candidate when settle down is more easier,
+ * which means page and task migration should as less as possible.
+ *
+ * However, usually it's impossible to find an ideal candidate since
+ * kernel have no idea about the cgroup numa affinity, thus we need
+ * to pick out the most likely winner and play gambling.
+ */
+static inline int find_candidate_nid(struct numa_balancer *nb)
+{
+	int cnid = -1;
+	int enid = nb->exectime_max_nid;
+	int pnid = nb->pages_max_nid;
+	int anid = nb->anon_max_nid;
+	struct node_info *nip;
+
+	/*
+	 * settled execution percent could imply the only available
+	 * node for running, respect this firstly.
+	 */
+	nip = &nb->ni[enid];
+	if (nb->cpu_usage_sum > cpu_high_wmark &&
+	    nip->exectime_percent > settled_wmark) {
+		cnid = enid;
+		goto out;
+	}
+
+	/*
+	 * Migrate page cost a lot, if the node is available for
+	 * running and most of the pages reside there, just pick it.
+	 */
+	nip = &nb->ni[pnid];
+	if (nip->exectime_percent &&
+	    nip->pages_percent > candidate_wmark) {
+		cnid = pnid;
+		goto out;
+	}
+
+	/*
+	 * Now pick the node when most of the execution time and
+	 * anonymous pages already there.
+	 */
+	nip = &nb->ni[anid];
+	if (nip->exectime_percent > candidate_wmark &&
+	    nip->anon_percent > candidate_wmark) {
+		cnid = anid;
+		goto out;
+	}
+
+	/*
+	 * No strong hint so we reach here, respect the load balancing
+	 * and play gambling.
+	 */
+	nip = &nb->ni[enid];
+	if (nb->cpu_usage_sum > cpu_high_wmark &&
+	    nip->exectime_percent > candidate_wmark) {
+		cnid = enid;
+		goto out;
+	}
+
+out:
+	nb->candidate_nid = cnid;
+	return cnid;
+}
+
+static inline unsigned long clip_period(unsigned long period)
+{
+	if (period < period_min)
+		return period_min;
+	if (period > period_max)
+		return period_max;
+	return period;
+}
+
+static inline void increase_period(struct numa_balancer *nb)
+{
+	nb->period = clip_period(nb->period * 2);
+}
+
+static inline void decrease_period(struct numa_balancer *nb)
+{
+	nb->period = clip_period(nb->period / 2);
+}
+
+static inline bool is_zombie(struct numa_balancer *nb)
+{
+	return (nb->cpu_usage_sum < cpu_low_wmark);
+}
+
+static inline bool is_settled(struct numa_balancer *nb, int nid)
+{
+	return (nb->ni[nid].exectime_percent > settled_wmark &&
+		nb->ni[nid].pages_percent > settled_wmark);
+}
+
+static inline void
+__memcg_printk(struct mem_cgroup *memcg, const char *fmt, ...)
+{
+	struct va_format vaf;
+	va_list args;
+	const char *name = memcg->css.cgroup->kn->name;
+
+	if (!debug_level)
+		return;
+
+	if (*name == '\0')
+		name = "root";
+
+	va_start(args, fmt);
+	vaf.fmt = fmt;
+	vaf.va = &args;
+	pr_notice("%s: [%s] %pV",
+		KBUILD_MODNAME, name, &vaf);
+	va_end(args);
+}
+
+static inline void
+__nb_printk(struct numa_balancer *nb, const char *fmt, ...)
+{
+	int nid;
+	struct va_format vaf;
+	va_list args;
+	const char *name = nb->memcg->css.cgroup->kn->name;
+
+	if (!debug_level)
+		return;
+
+	if (*name == '\0')
+		name = "root";
+
+	va_start(args, fmt);
+	vaf.fmt = fmt;
+	vaf.va = &args;
+	pr_notice("%s: [%s][stage %d] cpu %d%% %pV",
+		KBUILD_MODNAME, name, nb->stage, nb->cpu_usage_sum, &vaf);
+	va_end(args);
+
+	if (debug_level < 2)
+		return;
+
+	for_each_online_node(nid) {
+		struct node_info *nip = &nb->ni[nid];
+
+		pr_notice("%s: [%s][stage %d]\tnid %d exectime %llu[%d%%] anon %llu[%d%%] pages %llu[%d%%] idle [%d%%] free [%d%%]\n",
+			KBUILD_MODNAME, name, nb->stage,
+			nid, nip->exectime, nip->exectime_percent,
+			nip->anon, nip->anon_percent,
+			nip->pages, nip->pages_percent, nip->idle_percent,
+			nip->free_percent);
+	}
+}
+
+#define nb_printk(fmt...)	__nb_printk(nb, fmt)
+#define memcg_printk(fmt...)	__memcg_printk(memcg, fmt)
+
+static inline void reset_stage(struct numa_balancer *nb)
+{
+	nb->stage		= STAGE_1;
+	nb->period		= period_min;
+	nb->candidate_nid	= NUMA_NO_NODE;
+	nb->locality_score	= -1;
+	nb->downturn		= 0;
+
+	config_numa_preferred(nb->memcg, -1);
+}
+
+/*
+ * In most of the cases, we need to give kernel the hint of memcg
+ * preference in order to settle down on a particular node, the benefit
+ * is obviously while the risk too.
+ *
+ * Prefer behaviour could cause global influence and become a trigger
+ * for other memcg to make their own decision, ideally different memcg
+ * workloads will change their resources then settle down on different
+ * nodes, make resource balanced again and gain maximum numa benefit.
+ */
+static inline bool should_prefer(struct numa_balancer *nb, int cnid)
+{
+	struct node_info *cnip = &nb->ni[cnid];
+	u64 cpu_left, cpu_to_move, mem_left, mem_to_move;
+
+	if (nb->downturn >= prefer_level ||
+	    cnip->free_percent < free_low_wmark ||
+	    cnip->idle_percent < free_low_wmark)
+		return false;
+
+	/*
+	 * We don't want to cause starving on a particular node,
+	 * while there are race conditions and it's impossible to
+	 * predict the resource requirement in future, so risk can't
+	 * be avoided.
+	 *
+	 * Fortunately kernel won't respect numa prefer anymore if
+	 * things going to get worse :-P
+	 */
+	cpu_left = cpumask_weight(cpumask_of_node(cnid)) * 100 *
+			(cnip->idle_percent - free_low_wmark);
+	cpu_to_move = nb->cpu_usage_sum - cnip->cpu_usage;
+	if (cpu_left < cpu_to_move)
+		return false;
+
+	mem_left = cnip->total_pages *
+			(cnip->free_percent - free_low_wmark);
+	mem_to_move = nb->pages_sum - cnip->pages;
+	if (mem_left < mem_to_move)
+		return false;
+
+	return true;
+}
+
+static void STAGE_1_handler(struct numa_balancer *nb)
+{
+	int cnid;
+	struct node_info *cnip;
+
+	if (is_zombie(nb)) {
+		reset_stage(nb);
+		increase_period(nb);
+		nb_printk("zombie, silent for %lu seconds\n", nb->period / HZ);
+		return;
+	}
+
+	update_locality_score(nb);
+
+	cnid = find_candidate_nid(nb);
+	if (cnid == NUMA_NO_NODE) {
+		increase_period(nb);
+		nb_printk("no candidate locality %d%%, silent for %lu seconds\n",
+				nb->locality_score, nb->period / HZ);
+		return;
+	}
+
+	cnip = &nb->ni[cnid];
+	if (is_settled(nb, cnid)) {
+		increase_period(nb);
+		nb_printk("settle down node %d exectime %d%% pages %d%% locality %d%%, silent for %lu seconds\n",
+				cnid, cnip->exectime_percent,
+				cnip->pages_percent, nb->locality_score,
+				nb->period / HZ);
+		return;
+	}
+
+	if (!should_prefer(nb, cnid)) {
+		increase_period(nb);
+		nb_printk("discard node %d exectime %d%% pages %d%% locality %d%% free %d%%, silent for %lu seconds\n",
+				cnid, cnip->exectime_percent,
+				cnip->pages_percent, nb->locality_score,
+				cnip->free_percent, nb->period / HZ);
+		return;
+	}
+
+	nb_printk("prefer node %d exectime %d%% pages %d%% locality %d%% free %d%%, goto next stage\n",
+			cnid, cnip->exectime_percent, cnip->pages_percent,
+			nb->locality_score, cnip->free_percent);
+
+	config_numa_preferred(nb->memcg, cnid);
+
+	nb->stage++;
+	nb->period = period_min;
+}
+
+/*
+ * A tough decision here, as soon as we giveup prefer the node,
+ * kernel will lost the hint on memcg CPU preference, in good case
+ * tasks will still running on the right node since numa balancing
+ * preferred, but no more guarantees.
+ */
+static inline bool keep_prefer(struct numa_balancer *nb, int cnid)
+{
+	struct node_info *cnip = &nb->ni[cnid];
+
+	if (nb->downturn >= prefer_level)
+		return false;
+
+	/* stop prefer a harsh node */
+	if (cnip->free_percent < free_low_wmark ||
+	    cnip->idle_percent < free_low_wmark)
+		return false;
+
+	if (nb->locality_score > settled_wmark ||
+	    cnip->exectime_percent > settled_wmark)
+		return true;
+
+	if (cnip->exectime_percent > cnip->last_exectime_percent)
+		return true;
+
+	/*
+	 * kernel will make sure the balancing won't be broken, which
+	 * means some task won't stay on the preferred node when
+	 * balancing failed too much, imply that we should stop the
+	 * prefer behaviour to avoid the possible cpu starving on
+	 * the preferred node.
+	 *
+	 * Or maybe the current preferred node just haven't got enough
+	 * available cpus for memcg anymore.
+	 */
+	if (cnip->exectime_percent < candidate_wmark ||
+	    nb->exectime_max_nid != cnid)
+		return false;
+
+	return true;
+}
+
+static void STAGE_2_handler(struct numa_balancer *nb)
+{
+	int cnid;
+	struct node_info *cnip;
+
+	if (is_zombie(nb)) {
+		nb_printk("zombie, reset stage\n");
+		reset_stage(nb);
+		return;
+	}
+
+	cnid = nb->candidate_nid;
+	cnip = &nb->ni[cnid];
+
+	update_locality_score(nb);
+
+	if (keep_prefer(nb, cnid)) {
+		if (is_settled(nb, cnid))
+			increase_period(nb);
+		else
+			decrease_period(nb);
+
+		nb_printk("tangled node %d exectime %d%% pages %d%% locality %d%% free %d%%, silent for %lu seconds\n",
+				cnid, cnip->exectime_percent,
+				cnip->pages_percent, nb->locality_score,
+				cnip->free_percent, nb->period / HZ);
+		return;
+	}
+
+	nb_printk("giveup node %d exectime %d%% pages %d%% locality %d%% downturn %d free %d%%, reset stage\n",
+			cnid, cnip->exectime_percent, cnip->pages_percent,
+			nb->locality_score, nb->downturn, cnip->free_percent);
+
+	reset_stage(nb);
+}
+
+static void (*stage_handler[NR_STAGES])(struct numa_balancer *nb) = {
+	&STAGE_1_handler,
+	&STAGE_2_handler,
+};
+
+static void numa_balancer_workfn(struct work_struct *work)
+{
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct numa_balancer *nb =
+			container_of(dwork, struct numa_balancer, dwork);
+
+	update_numa_info(nb);
+	(stage_handler[nb->stage])(nb);
+	cond_resched();
+
+	queue_delayed_work(numa_balancer_wq, &nb->dwork, nb->period);
+}
+
+static void memcg_init_handler(struct mem_cgroup *memcg)
+{
+	struct numa_balancer *nb = memcg->numa_private;
+
+	if (!nb) {
+		nb = kzalloc(sizeof(struct numa_balancer), GFP_KERNEL);
+		if (!nb) {
+			pr_err("allocate balancer private failed\n");
+			return;
+		}
+
+		nb->ni = kcalloc(nr_online_nodes, sizeof(*nb->ni), GFP_KERNEL);
+		if (!nb->ni) {
+			pr_err("allocate balancer node info failed\n");
+			kfree(nb);
+			return;
+		}
+
+		nb->memcg = memcg;
+		memcg->numa_private = nb;
+
+		INIT_DELAYED_WORK(&nb->dwork, numa_balancer_workfn);
+	}
+
+	reset_stage(nb);
+	update_numa_info(nb);
+	update_locality_score(nb);
+
+	queue_delayed_work(numa_balancer_wq, &nb->dwork, nb->period);
+	memcg_printk("NUMA Balancer On\n");
+}
+
+static void memcg_exit_handler(struct mem_cgroup *memcg)
+{
+	struct numa_balancer *nb = memcg->numa_private;
+
+	if (nb) {
+		cancel_delayed_work_sync(&nb->dwork);
+
+		kfree(nb->ni);
+		kfree(nb);
+		memcg->numa_private = NULL;
+	}
+
+	config_numa_preferred(memcg, -1);
+	memcg_printk("NUMA Balancer Off\n");
+}
+
+struct memcg_callback cb = {
+	.init = memcg_init_handler,
+	.exit = memcg_exit_handler,
+};
+
+static int __init numa_balancer_init(void)
+{
+	if (nr_online_nodes < 2) {
+		pr_err("Single node arch don't need numa balancer\n");
+		return -EINVAL;
+	}
+
+	numa_balancer_wq = create_workqueue("numa_balancer");
+	if (!numa_balancer_wq) {
+		pr_err("Create workqueue failed\n");
+		return -ENOMEM;
+	}
+
+	if (register_memcg_callback(&cb) != 0) {
+		pr_err("Register memcg callback failed\n");
+		return -EINVAL;
+	}
+
+	pr_notice(KBUILD_MODNAME ": Initialization Done\n");
+	return 0;
+}
+
+static void __exit numa_balancer_exit(void)
+{
+	unregister_memcg_callback(&cb);
+	destroy_workqueue(numa_balancer_wq);
+
+	pr_notice(KBUILD_MODNAME ": Exit\n");
+}
+
+module_init(numa_balancer_init);
+module_exit(numa_balancer_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Michael Wang <yun.wang@linux.alibaba.com>");