From patchwork Wed Jul 3 03:28:10 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= X-Patchwork-Id: 11028869 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2128E13BD for ; Wed, 3 Jul 2019 03:28:19 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0FA6728911 for ; Wed, 3 Jul 2019 03:28:19 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id F191128968; Wed, 3 Jul 2019 03:28:18 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64, MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 87F3D28911 for ; Wed, 3 Jul 2019 03:28:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7F6176B0006; Tue, 2 Jul 2019 23:28:16 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 7A63F8E0003; Tue, 2 Jul 2019 23:28:16 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 647208E0001; Tue, 2 Jul 2019 23:28:16 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f197.google.com (mail-pf1-f197.google.com [209.85.210.197]) by kanga.kvack.org (Postfix) with ESMTP id 2EA9A6B0006 for ; Tue, 2 Jul 2019 23:28:16 -0400 (EDT) Received: by mail-pf1-f197.google.com with SMTP id r142so615795pfc.2 for ; Tue, 02 Jul 2019 20:28:16 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:references:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=pdTSiqMDGr4LN8zCFYkhXrxKYD4xfuvxP2NzvykBJ08=; b=bx6yR8nru+BGPADfIkzKvSK/QP7GheznYLDHdV/AYl8EUtHylyFNFcbPrmt0zzc7kZ u8QciWlEdzLuyWrORADyepX6pyrfhTFhJgc74fx9tcS8TM7yWWNcqmMEsz5MvnXCy+uT PLvVUZnZcx5YLm4mhweeZZJhFxSBvTyT1qPacOTxFowr50KUvIpDkr5Q1nY9qu6yRfpI TcM/wysTyLG284p0t/l9dbLxpN60yoeyO0EbAEBCmkxfcd33bd+xjs5HowvSW2f4Ejjo IDttql0lxlnebAb4INWEupGkr58BxRM8Rmx2YugbKnK7CAeIT+9Q4y/GTf7GN7feBN4d 9alg== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.57 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Gm-Message-State: APjAAAWfY4f65mLP0jr+VYb2afn9OUB7XiwBSpSTSCHMCdsbDPXanYyw PeKo+11QGqwJIvFqUz6Hat1kRnCY6eETPudHUxjlfDmJi1CdebgSlNROeMZoWCkXoTNC2RobCj6 F4W1rH/l/65t6HOI8TBl5UylxGLndVoyfemjIvYzCX+mW2L5OGeV0Dd6syj7O6qCBZA== X-Received: by 2002:a63:1c09:: with SMTP id c9mr35226267pgc.63.1562124495725; Tue, 02 Jul 2019 20:28:15 -0700 (PDT) X-Google-Smtp-Source: APXvYqzxODqVKNLrJQkePl8WtY8tnAxMq5xE6tQAGtpSA9C9cgaWjRH1hVU/D8YpzuFysIHCu7i6 X-Received: by 2002:a63:1c09:: with SMTP id c9mr35226135pgc.63.1562124494257; Tue, 02 Jul 2019 20:28:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1562124494; cv=none; d=google.com; s=arc-20160816; b=yXaVaUHCKihDEkpPBy5q975FgTWmzRAG0zb+nGchSw2tAYaVjaF8Hrz/FhIfE73YxZ eE0t4Y2IDwGHCga+1zAw347sSpCQiMylw7x3INAUP8q43W80eoGCEH/sSiABJtEANcP3 2nwbO4rDf//3ZBxsOGkjcNCirSjswIGYJuJ3Uu6WTTIQba2um5lFK3DMFdpILKfgyL6S MBDyf75tRzFUdJBM8v1SO4RT2/oZMciYmL3KSp/fClKMkVloQLjn5R7VYqijYENM07Ux a/Z7IjJC5/9b5Ri7jrTEoN1HPtRuH5UyqNKVV5eyQLc+yidaMYsZf8PCJ2yje0TBbk4g FaMQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:content-language:in-reply-to:mime-version :user-agent:date:message-id:references:cc:to:from:subject; bh=pdTSiqMDGr4LN8zCFYkhXrxKYD4xfuvxP2NzvykBJ08=; b=BVRX239SHndYoZjfuYUCMoCHk+gCPKFkrmXDc0cEPNtN6fTwUmH2yPgzfYsk7BHDxY tUJ9WFgyb8jCnsCD2T7n1aTODxquqOkviZouqIhdQ1v6gEogsRo9nZCpEzLfMOl6N01E 63BXYYuwHU1/2FWIxMggqIlwQl9wkYXvu1DCx1W0V8t7gOLJWqHDqW78kd0VLJAg6asi /ZAzseyfhN7OvvPSvwDRrT0ILNUEGWDjQVxLh/1eToEOQrxaD3IBJMK8JokJz4hgpA0l ugCRsE+hRj18y+tT+/txoqGaSvPrkOXJ9bp3EwGlXl5zoId+llgP90kUJx03wklqudan ymJQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.57 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out30-57.freemail.mail.aliyun.com (out30-57.freemail.mail.aliyun.com. [115.124.30.57]) by mx.google.com with ESMTPS id d3si847353pla.232.2019.07.02.20.28.13 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 02 Jul 2019 20:28:14 -0700 (PDT) Received-SPF: pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.57 as permitted sender) client-ip=115.124.30.57; Authentication-Results: mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.57 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04446;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=11;SR=0;TI=SMTPD_---0TVvMc2s_1562124490; Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com fp:SMTPD_---0TVvMc2s_1562124490) by smtp.aliyun-inc.com(127.0.0.1); Wed, 03 Jul 2019 11:28:10 +0800 Subject: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic From: =?utf-8?b?546L6LSH?= To: Peter Zijlstra , hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, mcgrof@kernel.org, keescook@chromium.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com> <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com> Message-ID: <3ac9b43a-cc80-01be-0079-df008a71ce4b@linux.alibaba.com> Date: Wed, 3 Jul 2019 11:28:10 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com> Content-Language: en-US X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP This patch introduced numa locality statistic, which try to imply the numa balancing efficiency per memory cgroup. By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we see new output line heading with 'locality', the format is: locality 0%~29% 30%~39% 40%~49% 50%~59% 60%~69% 70%~79% 80%~89% 90%~100% interval means that on a task's last numa balancing, the percentage of accessing local pages, which we called numa balancing locality. And the number means inside the cgroup, how many micro seconds tasks with that locality are running, for example: locality 15393 21259 13023 44461 21247 17012 28496 145402 the first number means that this cgroup have some tasks with 0~29% locality executed 15393 ms. By monitoring the increment, we can check if the workload of a particular cgroup is doing well with numa, when most of the tasks are running with locality 0~29%, then something is wrong with your numa policy. Signed-off-by: Michael Wang --- include/linux/memcontrol.h | 36 +++++++++++++++++++++++++++++++ include/linux/sched.h | 8 ++++++- kernel/sched/debug.c | 7 ++++++ kernel/sched/fair.c | 9 ++++++++ mm/memcontrol.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 112 insertions(+), 1 deletion(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 2cbce1fe7780..0a30d14c9f43 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -174,6 +174,25 @@ enum memcg_kmem_state { KMEM_ONLINE, }; +#ifdef CONFIG_NUMA_BALANCING + +enum memcg_numa_locality_interval { + PERCENT_0_29, + PERCENT_30_39, + PERCENT_40_49, + PERCENT_50_59, + PERCENT_60_69, + PERCENT_70_79, + PERCENT_80_89, + PERCENT_90_100, + NR_NL_INTERVAL, +}; + +struct memcg_stat_numa { + u64 locality[NR_NL_INTERVAL]; +}; + +#endif #if defined(CONFIG_SMP) struct memcg_padding { char x[0]; @@ -313,6 +332,10 @@ struct mem_cgroup { struct list_head event_list; spinlock_t event_list_lock; +#ifdef CONFIG_NUMA_BALANCING + struct memcg_stat_numa __percpu *stat_numa; +#endif + struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ }; @@ -795,6 +818,14 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, void mem_cgroup_split_huge_fixup(struct page *head); #endif +#ifdef CONFIG_NUMA_BALANCING +extern void memcg_stat_numa_update(struct task_struct *p); +#else +static inline void memcg_stat_numa_update(struct task_struct *p) +{ +} +#endif + #else /* CONFIG_MEMCG */ #define MEM_CGROUP_ID_SHIFT 0 @@ -1131,6 +1162,11 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { } + +static inline void memcg_stat_numa_update(struct task_struct *p) +{ +} + #endif /* CONFIG_MEMCG */ /* idx can be of type enum memcg_stat_item or node_stat_item */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 907808f1acc5..eb26098de6ea 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1117,8 +1117,14 @@ struct task_struct { * scan window were remote/local or failed to migrate. The task scan * period is adapted based on the locality of the faults with different * weights depending on whether they were shared or private faults + * + * 0 -- remote faults + * 1 -- local faults + * 2 -- page migration failure + * 3 -- remote page accessing + * 4 -- local page accessing */ - unsigned long numa_faults_locality[3]; + unsigned long numa_faults_locality[5]; unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index f7e4579e746c..473e6b7a1b8d 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -849,6 +849,13 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m) SEQ_printf(m, "current_node=%d, numa_group_id=%d\n", task_node(p), task_numa_group_id(p)); show_numa_stats(p, m); + SEQ_printf(m, "faults_locality local=%lu remote=%lu failed=%lu ", + p->numa_faults_locality[1], + p->numa_faults_locality[0], + p->numa_faults_locality[2]); + SEQ_printf(m, "lhit=%lu rhit=%lu\n", + p->numa_faults_locality[4], + p->numa_faults_locality[3]); mpol_put(pol); #endif } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 036be95a87e9..b32304817eeb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -23,6 +23,7 @@ #include "sched.h" #include +#include /* * Targeted preemption latency for CPU-bound tasks: @@ -2449,6 +2450,12 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages; p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages; p->numa_faults_locality[local] += pages; + /* + * We want to have the real local/remote page access statistic + * here, so use 'mem_node' which is the real residential node of + * page after migrate_misplaced_page(). + */ + p->numa_faults_locality[3 + !!(mem_node == numa_node_id())] += pages; } static void reset_ptenuma_scan(struct task_struct *p) @@ -2625,6 +2632,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr) if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work) return; + memcg_stat_numa_update(curr); + /* * Using runtime rather than walltime has the dual advantage that * we (mostly) drive the selection from busy threads and that the diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b3f67a6b6527..2edf3f5ac4b9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -58,6 +58,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -3562,10 +3563,53 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) seq_putc(m, '\n'); } +#ifdef CONFIG_NUMA_BALANCING + seq_puts(m, "locality"); + for (nr = 0; nr < NR_NL_INTERVAL; nr++) { + int cpu; + u64 sum = 0; + + for_each_possible_cpu(cpu) + sum += per_cpu(memcg->stat_numa->locality[nr], cpu); + + seq_printf(m, " %u", jiffies_to_msecs(sum)); + } + seq_putc(m, '\n'); +#endif + return 0; } #endif /* CONFIG_NUMA */ +#ifdef CONFIG_NUMA_BALANCING + +void memcg_stat_numa_update(struct task_struct *p) +{ + struct mem_cgroup *memcg; + unsigned long remote = p->numa_faults_locality[3]; + unsigned long local = p->numa_faults_locality[4]; + unsigned long idx = -1; + + if (mem_cgroup_disabled()) + return; + + if (remote || local) { + idx = ((local * 10) / (remote + local)) - 2; + /* 0~29% in one slot for cache align */ + if (idx < PERCENT_0_29) + idx = PERCENT_0_29; + else if (idx >= NR_NL_INTERVAL) + idx = NR_NL_INTERVAL - 1; + } + + rcu_read_lock(); + memcg = mem_cgroup_from_task(p); + if (idx != -1) + this_cpu_inc(memcg->stat_numa->locality[idx]); + rcu_read_unlock(); +} +#endif + static const unsigned int memcg1_stats[] = { MEMCG_CACHE, MEMCG_RSS, @@ -4641,6 +4685,9 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); +#ifdef CONFIG_NUMA_BALANCING + free_percpu(memcg->stat_numa); +#endif free_percpu(memcg->vmstats_percpu); free_percpu(memcg->vmstats_local); kfree(memcg); @@ -4679,6 +4726,12 @@ static struct mem_cgroup *mem_cgroup_alloc(void) if (!memcg->vmstats_percpu) goto fail; +#ifdef CONFIG_NUMA_BALANCING + memcg->stat_numa = alloc_percpu(struct memcg_stat_numa); + if (!memcg->stat_numa) + goto fail; +#endif + for_each_node(node) if (alloc_mem_cgroup_per_node_info(memcg, node)) goto fail; From patchwork Wed Jul 3 03:29:15 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= X-Patchwork-Id: 11028877 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4E1711398 for ; Wed, 3 Jul 2019 03:29:22 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4232028911 for ; Wed, 3 Jul 2019 03:29:22 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 34EAA28968; Wed, 3 Jul 2019 03:29:22 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64, MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id ECE3B28911 for ; Wed, 3 Jul 2019 03:29:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2C4FA8E0003; Tue, 2 Jul 2019 23:29:20 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 274C18E0001; Tue, 2 Jul 2019 23:29:20 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 13D418E0003; Tue, 2 Jul 2019 23:29:20 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pg1-f197.google.com (mail-pg1-f197.google.com [209.85.215.197]) by kanga.kvack.org (Postfix) with ESMTP id D2AA28E0001 for ; Tue, 2 Jul 2019 23:29:19 -0400 (EDT) Received: by mail-pg1-f197.google.com with SMTP id s5so724414pgr.4 for ; Tue, 02 Jul 2019 20:29:19 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:references:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=LkTAJVk/Jxzg+/wDCqXpwXHmb37Vnl6quR1WAtWAYUY=; b=ugZt3PuNlczLVxPVT9QN9HdNWxD8Y5lsf9+jTfgDOk3MGTRAnvhD/wimxePgM4nBbU aLEm+fg9elPQrQl9yV3WU4Ut/NEgBj35e88YCr54qI71GY0KEwko+kViB+5YH5psFsBs Al5Y/5uok91Sc4KPS0LbGNZ1I0F/AbLbf/qN+CsOZA/nRmval/+dgIqd6HXyvf6+039M lKjZir5xrDhT/BXQGVAvrmE1l9wrg+xuJ8me5zGP5rbNf1OvqNxF3I2uVqPHLuJdrtkE 2J8jXf7a5lSJgHfVEmEpNUbj4hscZMX7ODkOz+C88Fcs15KRP1dkKFZsEukdbRHvlLWM cmmQ== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.43 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Gm-Message-State: APjAAAVShOSKzy1WfU74xRPlBif3SqNuPY4Oi8jUzjzymU/lTfGitIa+ yyNXykY6CDWaiYmvjRtXC8feVkFkvDuvgBrf8KKAmGZIBEzOghC/5t0Pak0i+RFz3/ywAc34EAl ym2rsLAQoKGxRPoOdlR7arTZJJrRpqgBMbgQDr5YxkZRVlBcRZpCV/59pEsthSupIrg== X-Received: by 2002:a17:90a:8a15:: with SMTP id w21mr9707779pjn.134.1562124559546; Tue, 02 Jul 2019 20:29:19 -0700 (PDT) X-Google-Smtp-Source: APXvYqzmiJ3khZq5+KNEP0506PwaHDApvzaxCGXLZ8GkeoLZVt8/g32MmgAAnUT7hzIoXOH9bY+9 X-Received: by 2002:a17:90a:8a15:: with SMTP id w21mr9707690pjn.134.1562124558758; Tue, 02 Jul 2019 20:29:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1562124558; cv=none; d=google.com; s=arc-20160816; b=vduvm1+Tl3PM6NLPOHcRHD0p66h2CpPkqFkpWk8157BCvlCT1N5JTn7j5X4j84Wm9R 0L4/tRcXBhsW5ir64yzl+fVMNApthKPRRyTJp593wXnNsTNsH1jPhWEK9we97JECWsak edwNn44M79UYnil3aMNxzWBQvsn7roCMaT55DgwtPyappRuvU3Ms257AGDKKhyT5TYEg p7+5GVe17BXmUqFRgqJBXk0fro+Or3PjpoKxRSLNojmuO21fGkQCiCrt9Wg2fyRxrSWi AUS3zWUE7KUlEDQ2tdOmxgotSmb6RY6t9M7mpZCQjKldI797bgJnvJ+d+NCYK4QSfCmq MlQg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:content-language:in-reply-to:mime-version :user-agent:date:message-id:references:cc:to:from:subject; bh=LkTAJVk/Jxzg+/wDCqXpwXHmb37Vnl6quR1WAtWAYUY=; b=I2uWif5XWFDGKcP/wWTwRtugISCNK0uTdHV5WQA5CrqfXyLwdHT2J7EEe1OuQNUwg2 0l979s47FBs+kOdSXDcbZCWk/SsZdpHr7CaeQzcWJuWmXZGMIIyjQkgLjv5dbZHip+UG d1vo7Wu1UBnFUnQAzPSBCQKtrB4EHMNw3dPMIdK1xYP+np9WtRZOL8mN4SxKEy6dhqgS AS3eIFdpF9IVVp/chVTTh0ACDS7HeydfQK8Vu7Y0QU8+iJvHcqEobRlNKD4V5TqLSnIk jerZpRC4IdK43LJK5go1r62aDQKsiCELRftZfslX9gOMsnzamNzhbnvoUMLlw6CLCYpC WDAg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.43 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out30-43.freemail.mail.aliyun.com (out30-43.freemail.mail.aliyun.com. [115.124.30.43]) by mx.google.com with ESMTPS id r19si875034pfh.50.2019.07.02.20.29.18 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 02 Jul 2019 20:29:18 -0700 (PDT) Received-SPF: pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.43 as permitted sender) client-ip=115.124.30.43; Authentication-Results: mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.43 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R211e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07486;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=11;SR=0;TI=SMTPD_---0TVvOVSB_1562124555; Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com fp:SMTPD_---0TVvOVSB_1562124555) by smtp.aliyun-inc.com(127.0.0.1); Wed, 03 Jul 2019 11:29:16 +0800 Subject: [PATCH 2/4] numa: append per-node execution info in memory.numa_stat From: =?utf-8?b?546L6LSH?= To: Peter Zijlstra , hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, mcgrof@kernel.org, keescook@chromium.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com> <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com> Message-ID: <825ebaf0-9f71-bbe1-f054-7fa585d61af1@linux.alibaba.com> Date: Wed, 3 Jul 2019 11:29:15 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com> Content-Language: en-US X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP This patch introduced numa execution information, to imply the numa efficiency. By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we see new output line heading with 'exectime', like: exectime 311900 407166 which means the tasks of this cgroup executed 311900 micro seconds on node 0, and 407166 ms on node 1. Combined with the memory node info, we can estimate the numa efficiency, for example if the node memory info is: total=206892 N0=21933 N1=185171 By monitoring the increments, if the topology keep in this way and locality is not nice, then it imply numa balancing can't help migrate the memory from node 1 to 0 which is accessing by tasks on node 0, or tasks can't migrate to node 1 for some reason, then you may consider to bind the cgroup on the cpus of node 1. Signed-off-by: Michael Wang --- include/linux/memcontrol.h | 1 + mm/memcontrol.c | 13 +++++++++++++ 2 files changed, 14 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0a30d14c9f43..deeca9db17d8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -190,6 +190,7 @@ enum memcg_numa_locality_interval { struct memcg_stat_numa { u64 locality[NR_NL_INTERVAL]; + u64 exectime; }; #endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2edf3f5ac4b9..d5f48365770f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3575,6 +3575,18 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) seq_printf(m, " %u", jiffies_to_msecs(sum)); } seq_putc(m, '\n'); + + seq_puts(m, "exectime"); + for_each_online_node(nr) { + int cpu; + u64 sum = 0; + + for_each_cpu(cpu, cpumask_of_node(nr)) + sum += per_cpu(memcg->stat_numa->exectime, cpu); + + seq_printf(m, " %llu", jiffies_to_msecs(sum)); + } + seq_putc(m, '\n'); #endif return 0; @@ -3606,6 +3618,7 @@ void memcg_stat_numa_update(struct task_struct *p) memcg = mem_cgroup_from_task(p); if (idx != -1) this_cpu_inc(memcg->stat_numa->locality[idx]); + this_cpu_inc(memcg->stat_numa->exectime); rcu_read_unlock(); } #endif From patchwork Wed Jul 3 03:32:32 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= X-Patchwork-Id: 11028885 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 702FE1398 for ; Wed, 3 Jul 2019 03:32:40 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5632328969 for ; Wed, 3 Jul 2019 03:32:40 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4A4872896F; Wed, 3 Jul 2019 03:32:40 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64, MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 558F528969 for ; Wed, 3 Jul 2019 03:32:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 498AE8E0003; Tue, 2 Jul 2019 23:32:38 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 4465D8E0001; Tue, 2 Jul 2019 23:32:38 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3379B8E0003; Tue, 2 Jul 2019 23:32:38 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f197.google.com (mail-pf1-f197.google.com [209.85.210.197]) by kanga.kvack.org (Postfix) with ESMTP id EFA298E0001 for ; Tue, 2 Jul 2019 23:32:37 -0400 (EDT) Received: by mail-pf1-f197.google.com with SMTP id z1so616439pfb.7 for ; Tue, 02 Jul 2019 20:32:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:references:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=t8/x+W93k9fD4Zc9e/KCPfWkIE0qmGnu12pxnoCtgjA=; b=Fvr4fkl6prMNzPdIdrr8doN4DOh80NGQu+eIQMTM+dBwDIfYzix5vk8SbdKeCv/JEd R0evej/Jvb1GXbUAw35Lc4sG1wOUQ4G31JDcWfN0HYZvS1lLfJ5g8vVT0Gbe63uJMym/ pnSVrRvxm3AS9ZJoJJlbF+DfK/1TjTY8SJ/NzMbmyP6eVxE3HgEKNPCu4xinPhlhLiWG 4POFa1uPUsJfwp6Ewz1ACBqCCKZ/2iuyLw0BJBaiHscX49zrTMocsAdghZqxmNenOsly 3oT3rQtCArGmdzXq7PzhQOxewK2acup/LrHhwLPzboyFLImyi4SI4NTlyTPprOGta3Ix z1Hg== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.54 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Gm-Message-State: APjAAAXKYLCI7Et87lMhtBJlBd5DAZzpkQOGerfV0VYkLiG7imbl0nwv XRXjcqU3v9H1uavbMMv26+VEwLaOfAq9OD90sQ2EE7q1YhJwapdAti3l/zmbLrhvYLui9oc4y4t lzNwy0oDJsSHB/yvkJa9CfBTnv/Gm+W1V4rUkBRvo7MZzjun6UX7Y1vojldBbja50Ng== X-Received: by 2002:a65:404a:: with SMTP id h10mr35185191pgp.262.1562124757436; Tue, 02 Jul 2019 20:32:37 -0700 (PDT) X-Google-Smtp-Source: APXvYqwv/OU9Nssl+jLido2d20j6f79OgVWdR3EG7LItoipl37K+NwB2bbwEZEC1rZ1gStTjBRaB X-Received: by 2002:a65:404a:: with SMTP id h10mr35185049pgp.262.1562124755896; Tue, 02 Jul 2019 20:32:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1562124755; cv=none; d=google.com; s=arc-20160816; b=PRxpGTmhJkEI5LVXrv5ckA1tUq3L34gosEysLUVsSyYo4oT6BkFHm1d0vgsl10Lj2y F8SvyzifOBZpgsqU3mL4MckTi6WYD6Eo5TQeiN+JT2T/GwLx1E6SF5LSyU6rYQJe2yY1 LZYgw9g/5HUWxk5/6S73atewlgjGc1H8Z2cqltO/5DT4vStVtggQ6ZyPYfLBPX97qTpQ psef3S9LYdDaHqt5X+w9CJkwom3nhIOuOZdiLamfJPmJLS0oOEw7Zylw7uM2BFGnXkjV qJWMCe8GFLXya03RynJG21eyDUr7s8oPFVwW9yTT6ryFky0UdZjN4vPgQ1WZzolVTgRC iFrw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:content-language:in-reply-to:mime-version :user-agent:date:message-id:references:cc:to:from:subject; bh=t8/x+W93k9fD4Zc9e/KCPfWkIE0qmGnu12pxnoCtgjA=; b=eMvC15Cxps/WQD7LaPGYHFa7Cn7ofpZ+pmGftIvUMAMi1IGPiVRekABCEtGxM3Vz7B W9VdBkjD4k6lPb6MtQ5FLk6p4GTsMtU8mUJG3ZpaHQ8b3u+OUQAs+bJ0KRjorpX2Yd9s 1MZ2gzGb8vcbmq4bvdjJj27yCcP6OKMNbU0IhvP9pVpR1Avo6T2jIBHE8vl6Mm01dpEj H89N1RGQFDPEm8cbm1EeX1zobQfOmsYw2GpswhRw+/TcOT+Ytw6pTgKaJgvUZNnoH7tF eBkx8OOXYV6FCxGPjU5PRgzK8270/1D4C0UNLhBljocSoMg6qqaWjB2tvX0k6Svey31r N3Kg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.54 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out30-54.freemail.mail.aliyun.com (out30-54.freemail.mail.aliyun.com. [115.124.30.54]) by mx.google.com with ESMTPS id g1si879257plg.353.2019.07.02.20.32.35 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 02 Jul 2019 20:32:35 -0700 (PDT) Received-SPF: pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.54 as permitted sender) client-ip=115.124.30.54; Authentication-Results: mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.54 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R141e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=11;SR=0;TI=SMTPD_---0TVvMclc_1562124752; Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com fp:SMTPD_---0TVvMclc_1562124752) by smtp.aliyun-inc.com(127.0.0.1); Wed, 03 Jul 2019 11:32:33 +0800 Subject: [PATCH 3/4] numa: introduce numa group per task group From: =?utf-8?b?546L6LSH?= To: Peter Zijlstra , hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, mcgrof@kernel.org, keescook@chromium.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com> <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com> Message-ID: <93cf9333-2f9a-ca1e-a4a6-54fc388d1673@linux.alibaba.com> Date: Wed, 3 Jul 2019 11:32:32 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com> Content-Language: en-US X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP By tracing numa page faults, we recognize tasks sharing the same page, and try pack them together into a single numa group. However when two task share lot's of cache pages while not much anonymous pages, since numa balancing do not tracing cache page, they have no chance to join into the same group. While tracing cache page cost too much, we could use some hints from userland and cpu cgroup could be a good one. This patch introduced new entry 'numa_group' for cpu cgroup, by echo non-zero into the entry, we can now force all the tasks of this cgroup to join the same numa group serving for task group. In this way tasks are more likely to settle down on the same node, to share closer cpu cache and gain benefit from NUMA on both file/anonymous pages. Besides, when multiple cgroup enabled numa group, they will be able to exchange task location by utilizing numa migration, in this way they could achieve single node settle down without breaking load balance. Signed-off-by: Michael Wang --- kernel/sched/core.c | 37 +++++++++++ kernel/sched/fair.c | 175 ++++++++++++++++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 14 +++++ 3 files changed, 225 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fa43ce3962e7..148c231a4309 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6790,6 +6790,8 @@ void sched_offline_group(struct task_group *tg) { unsigned long flags; + update_tg_numa_group(tg, false); + /* End participation in shares distribution: */ unregister_fair_sched_group(tg); @@ -7277,6 +7279,34 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, } #endif /* CONFIG_RT_GROUP_SCHED */ +#ifdef CONFIG_NUMA_BALANCING +static DEFINE_MUTEX(numa_mutex); + +static int cpu_numa_group_show(struct seq_file *sf, void *v) +{ + struct task_group *tg = css_tg(seq_css(sf)); + + mutex_lock(&numa_mutex); + show_tg_numa_group(tg, sf); + mutex_unlock(&numa_mutex); + + return 0; +} + +static int cpu_numa_group_write_s64(struct cgroup_subsys_state *css, + struct cftype *cft, s64 numa_group) +{ + int ret; + struct task_group *tg = css_tg(css); + + mutex_lock(&numa_mutex); + ret = update_tg_numa_group(tg, numa_group); + mutex_unlock(&numa_mutex); + + return ret; +} +#endif /* CONFIG_NUMA_BALANCING */ + static struct cftype cpu_legacy_files[] = { #ifdef CONFIG_FAIR_GROUP_SCHED { @@ -7312,6 +7342,13 @@ static struct cftype cpu_legacy_files[] = { .read_u64 = cpu_rt_period_read_uint, .write_u64 = cpu_rt_period_write_uint, }, +#endif +#ifdef CONFIG_NUMA_BALANCING + { + .name = "numa_group", + .write_s64 = cpu_numa_group_write_s64, + .seq_show = cpu_numa_group_show, + }, #endif { } /* Terminate */ }; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b32304817eeb..6cf9c9c61258 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1074,6 +1074,7 @@ struct numa_group { int nr_tasks; pid_t gid; int active_nodes; + bool evacuate; struct rcu_head rcu; unsigned long total_faults; @@ -2247,6 +2248,176 @@ static inline void put_numa_group(struct numa_group *grp) kfree_rcu(grp, rcu); } +void show_tg_numa_group(struct task_group *tg, struct seq_file *sf) +{ + int nid; + struct numa_group *ng = tg->numa_group; + + if (!ng) { + seq_puts(sf, "disabled\n"); + return; + } + + seq_printf(sf, "id %d nr_tasks %d active_nodes %d\n", + ng->gid, ng->nr_tasks, ng->active_nodes); + + for_each_online_node(nid) { + int f_idx = task_faults_idx(NUMA_MEM, nid, 0); + int pf_idx = task_faults_idx(NUMA_MEM, nid, 1); + + seq_printf(sf, "node %d ", nid); + + seq_printf(sf, "mem_private %lu mem_shared %lu ", + ng->faults[f_idx], ng->faults[pf_idx]); + + seq_printf(sf, "cpu_private %lu cpu_shared %lu\n", + ng->faults_cpu[f_idx], ng->faults_cpu[pf_idx]); + } +} + +int update_tg_numa_group(struct task_group *tg, bool numa_group) +{ + struct numa_group *ng = tg->numa_group; + + /* if no change then do nothing */ + if ((ng != NULL) == numa_group) + return 0; + + if (ng) { + /* put and evacuate tg's numa group */ + rcu_assign_pointer(tg->numa_group, NULL); + ng->evacuate = true; + put_numa_group(ng); + } else { + unsigned int size = sizeof(struct numa_group) + + 4*nr_node_ids*sizeof(unsigned long); + + ng = kzalloc(size, GFP_KERNEL | __GFP_NOWARN); + if (!ng) + return -ENOMEM; + + refcount_set(&ng->refcount, 1); + spin_lock_init(&ng->lock); + ng->faults_cpu = ng->faults + NR_NUMA_HINT_FAULT_TYPES * + nr_node_ids; + /* now make tasks see and join */ + rcu_assign_pointer(tg->numa_group, ng); + } + + return 0; +} + +static bool tg_numa_group(struct task_struct *p) +{ + int i; + struct task_group *tg; + struct numa_group *grp, *my_grp; + + rcu_read_lock(); + + tg = task_group(p); + if (!tg) + goto no_join; + + grp = rcu_dereference(tg->numa_group); + my_grp = rcu_dereference(p->numa_group); + + if (!grp) + goto no_join; + + if (grp == my_grp) { + if (!grp->evacuate) + goto joined; + + /* + * Evacuate task from tg's numa group + */ + rcu_read_unlock(); + + spin_lock_irq(&grp->lock); + + for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) + grp->faults[i] -= p->numa_faults[i]; + + grp->total_faults -= p->total_numa_faults; + grp->nr_tasks--; + + spin_unlock_irq(&grp->lock); + + rcu_assign_pointer(p->numa_group, NULL); + + put_numa_group(grp); + + return false; + } + + if (!get_numa_group(grp)) + goto no_join; + + rcu_read_unlock(); + + /* + * Just join tg's numa group + */ + if (!my_grp) { + spin_lock_irq(&grp->lock); + + if (refcount_read(&grp->refcount) == 2) { + grp->gid = p->pid; + grp->active_nodes = 1; + grp->max_faults_cpu = 0; + } + + for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) + grp->faults[i] += p->numa_faults[i]; + + grp->total_faults += p->total_numa_faults; + grp->nr_tasks++; + + spin_unlock_irq(&grp->lock); + rcu_assign_pointer(p->numa_group, grp); + + return true; + } + + /* + * Switch from the task's numa group to the tg's + */ + double_lock_irq(&my_grp->lock, &grp->lock); + + if (refcount_read(&grp->refcount) == 2) { + grp->gid = p->pid; + grp->active_nodes = 1; + grp->max_faults_cpu = 0; + } + + for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) { + my_grp->faults[i] -= p->numa_faults[i]; + grp->faults[i] += p->numa_faults[i]; + } + + my_grp->total_faults -= p->total_numa_faults; + grp->total_faults += p->total_numa_faults; + + my_grp->nr_tasks--; + grp->nr_tasks++; + + spin_unlock(&my_grp->lock); + spin_unlock_irq(&grp->lock); + + rcu_assign_pointer(p->numa_group, grp); + + put_numa_group(my_grp); + return true; + +joined: + rcu_read_unlock(); + return true; +no_join: + rcu_read_unlock(); + return false; +} + static void task_numa_group(struct task_struct *p, int cpupid, int flags, int *priv) { @@ -2417,7 +2588,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) priv = 1; } else { priv = cpupid_match_pid(p, last_cpupid); - if (!priv && !(flags & TNF_NO_GROUP)) + if (tg_numa_group(p)) + priv = (flags & TNF_SHARED) ? 0 : priv; + else if (!priv && !(flags & TNF_NO_GROUP)) task_numa_group(p, last_cpupid, flags, &priv); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 802b1f3405f2..b5bc4d804e2d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -393,6 +393,10 @@ struct task_group { #endif struct cfs_bandwidth cfs_bandwidth; + +#ifdef CONFIG_NUMA_BALANCING + void *numa_group; +#endif }; #ifdef CONFIG_FAIR_GROUP_SCHED @@ -1286,11 +1290,21 @@ extern int migrate_task_to(struct task_struct *p, int cpu); extern int migrate_swap(struct task_struct *p, struct task_struct *t, int cpu, int scpu); extern void init_numa_balancing(unsigned long clone_flags, struct task_struct *p); +extern void show_tg_numa_group(struct task_group *tg, struct seq_file *sf); +extern int update_tg_numa_group(struct task_group *tg, bool numa_group); #else static inline void init_numa_balancing(unsigned long clone_flags, struct task_struct *p) { } +static inline void +show_tg_numa_group(struct task_group *tg, struct seq_file *sf) +{ +} +update_tg_numa_group(struct task_group *tg, bool numa_group) +{ + return 0; +} #endif /* CONFIG_NUMA_BALANCING */ #ifdef CONFIG_SMP From patchwork Wed Jul 3 03:34:16 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= X-Patchwork-Id: 11028889 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 83DA11398 for ; Wed, 3 Jul 2019 03:34:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 67AB32896A for ; Wed, 3 Jul 2019 03:34:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4F2632896F; Wed, 3 Jul 2019 03:34:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64, MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1459928969 for ; Wed, 3 Jul 2019 03:34:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 35CBC8E0005; Tue, 2 Jul 2019 23:34:23 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 2E77E8E0001; Tue, 2 Jul 2019 23:34:23 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1600F8E0005; Tue, 2 Jul 2019 23:34:23 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pg1-f200.google.com (mail-pg1-f200.google.com [209.85.215.200]) by kanga.kvack.org (Postfix) with ESMTP id C3FB58E0001 for ; Tue, 2 Jul 2019 23:34:22 -0400 (EDT) Received: by mail-pg1-f200.google.com with SMTP id b18so728249pgg.8 for ; Tue, 02 Jul 2019 20:34:22 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:references:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=Gk9SeGja9usBBJPCwER3YN/IVOL2c/cZ/AXA/jhDGO8=; b=apw8u2q0i+Zc1BbRrVLBxhxDEOOYGwyuFNO6q8sXsM9DIartuzQoM9S1sEOC9ZGWE+ jrRu2R9xSUzzNP10kKAnPswdh00PK2W9HAhaXKCnAvWc6M8Ptw6VgBQeyQCDnfi89qBV RVZRKwA4FnEz4alOG2gEksrPr8ElRNIBYioc0GaQAJnCrGNaPi+oNE/Vh4OF7Jo8sZhn B6yc1ldT9JUc1dyC+Xw1zAHqFV9jBufDUjk05ccGp10E2RQ1MEJPRpFTeTdeF8jB4Uh8 oEqYIt6gCgQJnrGhadikJf9dbHY6PNYznHC3mEYpMpLEESnUzqL3VPIFKpflLngr3jha hrZw== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Gm-Message-State: APjAAAXzGKnfkprBQ1J94p8LoLukqmAfeTHDbkenMHqAeSGJPQQH2lQm NEvkhV3q8rU0JfwDpzGkFwGPZ1YTiHjeeMErSH0nKGNJU9c49oWq4OBBhpfeAMB16S2Ta4q7E7H Peh1y3h8QARLVJDl0ysxM33rWW7XygqxNJ5EY1zVL3SmXBaoYIVEmgnBV6OLwVdY8Sg== X-Received: by 2002:a17:90a:374f:: with SMTP id u73mr9689132pjb.4.1562124862389; Tue, 02 Jul 2019 20:34:22 -0700 (PDT) X-Google-Smtp-Source: APXvYqxGAB1w5w+eWJX3HSNQxQuUOhMzrRRhjT4+PJmU6cDoUT5bn8JBSKwPTeN+sEgDHtc75iKL X-Received: by 2002:a17:90a:374f:: with SMTP id u73mr9688925pjb.4.1562124860296; Tue, 02 Jul 2019 20:34:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1562124860; cv=none; d=google.com; s=arc-20160816; b=afby8cM7nDVLAULpmvzmPEdxqKnG8Qo/h0gCFWNS9+y6upLut0HugYojLlZBRcHNfV A8FW/AdN/lfyD2PI7cfzMe1zVVoudvJvvfrXugNEQqo3WJMhSrEGpymMgO2GdiqENv9m DO1Kdyzuk2SbKqSMY20a8hGrgcvrieQjtzYeaxtXxmavi5AZYh8+rpOgHoIOLU7QWJmR 1PrUhI7MnzyF9JLrwc53+nfcdR6lRRlyuBNVNXqykUDIA5tHEdkH2OvNC+S0bBvTfYZG BpB+SHyqYThyOzChK700FZKY/o2b4S8zYyzv6sg8h2Y+vomyjxVpFH9X2vK7OJHHczru B7lg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:content-language:in-reply-to:mime-version :user-agent:date:message-id:references:cc:to:from:subject; bh=Gk9SeGja9usBBJPCwER3YN/IVOL2c/cZ/AXA/jhDGO8=; b=Q8g+n+N6Hxy6kRjdmlDWDeDKl30wWioDUJ5HNmltcsLiXFXnMl+hAjRicf3GzqQZSY n9rlIB3PWAjVlya93KXXU4NFqRfFGa0G4i+CcEPjYki2nWslLi9id45u+l5V63KMI47z U/x1uLe56+bMR6a8M5VnFPPkxwIMNHApMw+aebKgpw5obgQfqErcLcm48Zh55B1KKohu ohqjqWN/HUvfnnWpqXsGK//kmpQ28G+ANxky6RC04YYgZ8CT9zDYcAV0s0JddsgAzxmH 8ipICv73bWQD6PRt6h/DpktmABkvFifaj6biF142LKRWaCuNSWRneU+nxU7d4oLX50rW z+gw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com. [115.124.30.132]) by mx.google.com with ESMTPS id f15si976206plr.260.2019.07.02.20.34.19 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 02 Jul 2019 20:34:20 -0700 (PDT) Received-SPF: pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.132 as permitted sender) client-ip=115.124.30.132; Authentication-Results: mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R191e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04426;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=11;SR=0;TI=SMTPD_---0TVvNlqm_1562124856; Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com fp:SMTPD_---0TVvNlqm_1562124856) by smtp.aliyun-inc.com(127.0.0.1); Wed, 03 Jul 2019 11:34:17 +0800 Subject: [PATCH 4/4] numa: introduce numa cling feature From: =?utf-8?b?546L6LSH?= To: Peter Zijlstra , hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, mcgrof@kernel.org, keescook@chromium.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com> <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com> Message-ID: <9a440936-1e5d-d3bb-c795-ef6f9839a021@linux.alibaba.com> Date: Wed, 3 Jul 2019 11:34:16 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com> Content-Language: en-US X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Although we paid so many effort to settle down task on a particular node, there are still chances for a task to leave it's preferred node, that is by wakeup, numa swap migrations or load balance. When we are using cpu cgroup in share way, since all the workloads see all the cpus, it could be really bad especially when there are too many fast wakeup, although now we can numa group the tasks, they won't really stay on the same node, for example we have numa group ng_A, ng_B, ng_C, ng_D, it's very likely result as: CPU Usage: Node 0 Node 1 ng_A(600%) ng_A(400%) ng_B(400%) ng_B(600%) ng_C(400%) ng_C(600%) ng_D(600%) ng_D(400%) Memory Ratio: Node 0 Node 1 ng_A(60%) ng_A(40%) ng_B(40%) ng_B(60%) ng_C(40%) ng_C(60%) ng_D(60%) ng_D(40%) Locality won't be too bad but far from the best situation, we want a numa group to settle down thoroughly on a particular node, with every thing balanced. Thus we introduce the numa cling, which try to prevent tasks leaving the preferred node on wakeup fast path. This help thoroughly settle down the workloads on single node, but when multiple numa group try to settle down on the same node, unbalancing could happen. For example we have numa group ng_A, ng_B, ng_C, ng_D, it may result in situation like: CPU Usage: Node 0 Node 1 ng_A(1000%) ng_B(1000%) ng_C(400%) ng_C(600%) ng_D(400%) ng_D(600%) Memory Ratio: Node 0 Node 1 ng_A(100%) ng_B(100%) ng_C(10%) ng_C(90%) ng_D(10%) ng_D(90%) This is because when ng_C, ng_D start to have most of the memory on node 1 at some point, task_x of ng_C stay on node 0 will try to do numa swap migration with the task_y of ng_D stay on node 1 as long as load balanced, the result is task_x stay on node 1 and task_y stay on node 0, while both of them prefer node 1. Now when other tasks of ng_D stay on node 1 wakeup task_y, task_y will very likely go back to node 1, and since numa cling enabled, it will keep stay on node 1 although load unbalanced, this could be frequently and more and more tasks will prefer the node 1 and make it busy. So the key point here is to stop doing numa cling when load starting to become unbalancing. We achieved this by monitoring the migration failure ratio, in scenery above, too much tasks prefer node 1 and will keep migrating to it, load unbalancing could lead into the migration failure in this case, and when the failure ratio above the specified degree, we pause the cling and try to resettle the workloads on a better node by stop tasks prefer the busy node, this will finally give us the result like: CPU Usage: Node 0 Node 1 ng_A(1000%) ng_B(1000%) ng_C(1000%) ng_D(1000%) Memory Ratio: Node 0 Node 1 ng_A(100%) ng_B(100%) ng_C(100%) ng_D(100%) Now we achieved the best locality and maximum hot cache benefit. Tested on a 2 node box with 96 cpus, do sysbench-mysql-oltp_read_write testing, X mysqld instances created and attached to X cgroups, X sysbench instances then created and attached to corresponding cgroup to test the mysql with oltp_read_write script for 20 minutes, average eps show: origin ng + cling 4 instances each 24 threads 7545.28 7790.49 +3.25% 4 instances each 48 threads 9359.36 9832.30 +5.05% 4 instances each 72 threads 9602.88 10196.95 +6.19% 8 instances each 24 threads 4478.82 4508.82 +0.67% 8 instances each 48 threads 5514.90 5689.93 +3.17% 8 instances each 72 threads 5582.19 5741.33 +2.85% Also tested with perf-bench-numa, dbench, sysbench-memory, pgbench, tiny improvement observed. Signed-off-by: Michael Wang --- include/linux/sched/sysctl.h | 3 + kernel/sched/fair.c | 283 +++++++++++++++++++++++++++++++++++++++++-- kernel/sysctl.c | 9 ++ 3 files changed, 283 insertions(+), 12 deletions(-) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index d4f6215ee03f..6eef34331dd2 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -38,6 +38,9 @@ extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; +extern unsigned int sysctl_numa_balancing_cling_degree; +extern unsigned int max_numa_balancing_cling_degree; + #ifdef CONFIG_SCHED_DEBUG extern __read_mostly unsigned int sysctl_sched_migration_cost; extern __read_mostly unsigned int sysctl_sched_nr_migrate; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6cf9c9c61258..a4a48cdd2bbd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1067,6 +1067,20 @@ unsigned int sysctl_numa_balancing_scan_size = 256; /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */ unsigned int sysctl_numa_balancing_scan_delay = 1000; +/* + * The numa group serving task group will enable numa cling, a feature + * which try to prevent task leaving preferred node on wakeup. + * + * This help settle down the workloads thorouly and quickly on node, + * while introduce the risk of load unbalancing. + * + * In order to detect the risk in advance and pause the feature, we + * rely on numa migration failure stats, and when failure ratio above + * cling degree, we pause the numa cling until resettle done. + */ +unsigned int sysctl_numa_balancing_cling_degree = 20; +unsigned int max_numa_balancing_cling_degree = 100; + struct numa_group { refcount_t refcount; @@ -1074,11 +1088,15 @@ struct numa_group { int nr_tasks; pid_t gid; int active_nodes; + int busiest_nid; bool evacuate; + bool do_cling; + struct timer_list cling_timer; struct rcu_head rcu; unsigned long total_faults; unsigned long max_faults_cpu; + unsigned long *migrate_stat; /* * Faults_cpu is used to decide whether memory should move * towards the CPU. As a consequence, these stats are weighted @@ -1088,6 +1106,8 @@ struct numa_group { unsigned long faults[0]; }; +static inline bool busy_node(struct numa_group *ng, int nid); + static inline unsigned long group_faults_priv(struct numa_group *ng); static inline unsigned long group_faults_shared(struct numa_group *ng); @@ -1132,8 +1152,14 @@ static unsigned int task_scan_start(struct task_struct *p) unsigned long smin = task_scan_min(p); unsigned long period = smin; - /* Scale the maximum scan period with the amount of shared memory. */ - if (p->numa_group) { + /* + * Scale the maximum scan period with the amount of shared memory. + * + * Not for the numa group serving task group, it's tasks are not + * gathered for sharing memory, and we need to detect migration + * failure in time. + */ + if (p->numa_group && !p->numa_group->do_cling) { struct numa_group *ng = p->numa_group; unsigned long shared = group_faults_shared(ng); unsigned long private = group_faults_priv(ng); @@ -1154,8 +1180,14 @@ static unsigned int task_scan_max(struct task_struct *p) /* Watch for min being lower than max due to floor calculations */ smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p); - /* Scale the maximum scan period with the amount of shared memory. */ - if (p->numa_group) { + /* + * Scale the maximum scan period with the amount of shared memory. + * + * Not for the numa group serving task group, it's tasks are not + * gathered for sharing memory, and we need to detect migration + * failure in time. + */ + if (p->numa_group && !p->numa_group->do_cling) { struct numa_group *ng = p->numa_group; unsigned long shared = group_faults_shared(ng); unsigned long private = group_faults_priv(ng); @@ -1475,6 +1507,19 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, ACTIVE_NODE_FRACTION) return true; + /* + * Make sure pages do not stay on a busy node when numa cling + * enabled, otherwise they could lead into more numa migration + * to the busy node. + */ + if (ng->do_cling) { + if (busy_node(ng, dst_nid)) + return false; + + if (busy_node(ng, src_nid)) + return true; + } + /* * Distribute memory according to CPU & memory use on each node, * with 3/4 hysteresis to avoid unnecessary memory migrations: @@ -1874,9 +1919,190 @@ static int task_numa_migrate(struct task_struct *p) return ret; } +/* + * We scale the migration stat count to 1024, divide the maximum numa + * balancing scan period by 10 and make that the period of cling timer, + * this help to decay one count to 0 after one maximum scan period passed. + */ +#define NUMA_MIGRATE_SCALE 10 +#define NUMA_MIGRATE_WEIGHT 1024 + +enum numa_migrate_stats { + FAILURE_SCALED, + TOTAL_SCALED, + FAILURE_RATIO, +}; + +static inline int mstat_idx(int nid, enum numa_migrate_stats s) +{ + return (nid + s * nr_node_ids); +} + +static inline unsigned long +mstat_failure_scaled(struct numa_group *ng, int nid) +{ + return ng->migrate_stat[mstat_idx(nid, FAILURE_SCALED)]; +} + +static inline unsigned long +mstat_total_scaled(struct numa_group *ng, int nid) +{ + return ng->migrate_stat[mstat_idx(nid, TOTAL_SCALED)]; +} + +static inline unsigned long +mstat_failure_ratio(struct numa_group *ng, int nid) +{ + return ng->migrate_stat[mstat_idx(nid, FAILURE_RATIO)]; +} + +/* + * A node is busy when the numa migration toward it failed too much, + * this imply the load already unbalancing for too much numa cling on + * that node. + */ +static inline bool busy_node(struct numa_group *ng, int nid) +{ + int degree = sysctl_numa_balancing_cling_degree; + + if (mstat_failure_scaled(ng, nid) < NUMA_MIGRATE_WEIGHT) + return false; + + /* + * Allow only one busy node in one numa group, to prevent + * ping-pong migration case between nodes. + */ + if (ng->busiest_nid != nid) + return false; + + return mstat_failure_ratio(ng, nid) > degree; +} + +/* + * Return true if the task should cling to snid, when it preferred snid + * rather than dnid and snid is not busy. + */ +static inline bool +task_numa_cling(struct task_struct *p, int snid, int dnid) +{ + bool ret = false; + int pnid = p->numa_preferred_nid; + struct numa_group *ng; + + rcu_read_lock(); + + ng = p->numa_group; + + /* Do cling only when the feature enabled and not in pause */ + if (!ng || !ng->do_cling) + goto out; + + if (pnid == NUMA_NO_NODE || + dnid == pnid || + snid != pnid) + goto out; + + /* Never allow cling to a busy node */ + if (busy_node(ng, snid)) + goto out; + + ret = true; +out: + rcu_read_unlock(); + return ret; +} + +/* + * Prevent more tasks from prefer the busy node to easy the unbalancing, + * also give the second candidate a chance. + */ +static inline bool group_pause_prefer(struct numa_group *ng, int nid) +{ + if (!ng || !ng->do_cling) + return false; + + return busy_node(ng, nid); +} + +static inline void update_failure_ratio(struct numa_group *ng, int nid) +{ + int f_idx = mstat_idx(nid, FAILURE_SCALED); + int t_idx = mstat_idx(nid, TOTAL_SCALED); + int fp_idx = mstat_idx(nid, FAILURE_RATIO); + + ng->migrate_stat[fp_idx] = + ng->migrate_stat[f_idx] * 100 / (ng->migrate_stat[t_idx] + 1); +} + +static void cling_timer_func(struct timer_list *t) +{ + int nid; + unsigned int degree; + unsigned long period, max_failure; + struct numa_group *ng = from_timer(ng, t, cling_timer); + + degree = sysctl_numa_balancing_cling_degree; + period = msecs_to_jiffies(sysctl_numa_balancing_scan_period_max); + period /= NUMA_MIGRATE_SCALE; + + spin_lock_irq(&ng->lock); + + max_failure = 0; + for_each_online_node(nid) { + int f_idx = mstat_idx(nid, FAILURE_SCALED); + int t_idx = mstat_idx(nid, TOTAL_SCALED); + + ng->migrate_stat[f_idx] /= 2; + ng->migrate_stat[t_idx] /= 2; + + update_failure_ratio(ng, nid); + + if (ng->migrate_stat[f_idx] > max_failure) { + ng->busiest_nid = nid; + max_failure = ng->migrate_stat[f_idx]; + } + } + + spin_unlock_irq(&ng->lock); + + mod_timer(&ng->cling_timer, jiffies + period); +} + +static inline void +update_migrate_stat(struct task_struct *p, int nid, bool failed) +{ + int idx; + struct numa_group *ng = p->numa_group; + + if (!ng || !ng->do_cling) + return; + + spin_lock_irq(&ng->lock); + + if (failed) { + idx = mstat_idx(nid, FAILURE_SCALED); + ng->migrate_stat[idx] += NUMA_MIGRATE_WEIGHT; + } + + idx = mstat_idx(nid, TOTAL_SCALED); + ng->migrate_stat[idx] += NUMA_MIGRATE_WEIGHT; + update_failure_ratio(ng, nid); + + spin_unlock_irq(&ng->lock); + + /* + * On failed task may prefer source node instead, this + * cause ping-pong migration when numa cling enabled, + * so let's reset the preferred node to none. + */ + if (failed) + sched_setnuma(p, NUMA_NO_NODE); +} + /* Attempt to migrate a task to a CPU on the preferred node. */ static void numa_migrate_preferred(struct task_struct *p) { + bool failed, target; unsigned long interval = HZ; /* This task has no NUMA fault statistics yet */ @@ -1891,8 +2117,12 @@ static void numa_migrate_preferred(struct task_struct *p) if (task_node(p) == p->numa_preferred_nid) return; + target = p->numa_preferred_nid; + /* Otherwise, try migrate to a CPU on the preferred node */ - task_numa_migrate(p); + failed = (task_numa_migrate(p) != 0); + + update_migrate_stat(p, target, failed); } /* @@ -2216,7 +2446,8 @@ static void task_numa_placement(struct task_struct *p) max_faults = faults; max_nid = nid; } - } else if (group_faults > max_faults) { + } else if (group_faults > max_faults && + !group_pause_prefer(p->numa_group, nid)) { max_faults = group_faults; max_nid = nid; } @@ -2258,8 +2489,10 @@ void show_tg_numa_group(struct task_group *tg, struct seq_file *sf) return; } - seq_printf(sf, "id %d nr_tasks %d active_nodes %d\n", - ng->gid, ng->nr_tasks, ng->active_nodes); + spin_lock_irq(&ng->lock); + + seq_printf(sf, "id %d nr_tasks %d active_nodes %d busiest_nid %d\n", + ng->gid, ng->nr_tasks, ng->active_nodes, ng->busiest_nid); for_each_online_node(nid) { int f_idx = task_faults_idx(NUMA_MEM, nid, 0); @@ -2270,9 +2503,16 @@ void show_tg_numa_group(struct task_group *tg, struct seq_file *sf) seq_printf(sf, "mem_private %lu mem_shared %lu ", ng->faults[f_idx], ng->faults[pf_idx]); - seq_printf(sf, "cpu_private %lu cpu_shared %lu\n", + seq_printf(sf, "cpu_private %lu cpu_shared %lu ", ng->faults_cpu[f_idx], ng->faults_cpu[pf_idx]); + + seq_printf(sf, "migrate_stat %lu %lu %lu\n", + mstat_failure_scaled(ng, nid), + mstat_total_scaled(ng, nid), + mstat_failure_ratio(ng, nid)); } + + spin_unlock_irq(&ng->lock); } int update_tg_numa_group(struct task_group *tg, bool numa_group) @@ -2286,20 +2526,26 @@ int update_tg_numa_group(struct task_group *tg, bool numa_group) if (ng) { /* put and evacuate tg's numa group */ rcu_assign_pointer(tg->numa_group, NULL); + del_timer_sync(&ng->cling_timer); ng->evacuate = true; put_numa_group(ng); } else { unsigned int size = sizeof(struct numa_group) + - 4*nr_node_ids*sizeof(unsigned long); + 7*nr_node_ids*sizeof(unsigned long); + unsigned int offset = NR_NUMA_HINT_FAULT_TYPES * nr_node_ids; ng = kzalloc(size, GFP_KERNEL | __GFP_NOWARN); if (!ng) return -ENOMEM; refcount_set(&ng->refcount, 1); + ng->busiest_nid = NUMA_NO_NODE; + ng->do_cling = true; + timer_setup(&ng->cling_timer, cling_timer_func, 0); spin_lock_init(&ng->lock); - ng->faults_cpu = ng->faults + NR_NUMA_HINT_FAULT_TYPES * - nr_node_ids; + ng->faults_cpu = ng->faults + offset; + ng->migrate_stat = ng->faults_cpu + offset; + add_timer(&ng->cling_timer); /* now make tasks see and join */ rcu_assign_pointer(tg->numa_group, ng); } @@ -2436,6 +2682,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, return; refcount_set(&grp->refcount, 1); + grp->busiest_nid = NUMA_NO_NODE; grp->active_nodes = 1; grp->max_faults_cpu = 0; spin_lock_init(&grp->lock); @@ -2879,6 +3126,11 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu) { } +static inline bool task_numa_cling(struct task_struct *p, int snid, int dnid) +{ + return false; +} + #endif /* CONFIG_NUMA_BALANCING */ static void @@ -6195,6 +6447,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; + /* + * Failed to find an idle cpu, wake affine may want to pull but + * try stay on prev-cpu when the task cling to it. + */ + if (task_numa_cling(p, cpu_to_node(prev), cpu_to_node(target))) + return prev; + return target; } diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 078950d9605b..0a889dd1c7ed 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -417,6 +417,15 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ONE, }, + { + .procname = "numa_balancing_cling_degree", + .data = &sysctl_numa_balancing_cling_degree, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = &max_numa_balancing_cling_degree, + }, { .procname = "numa_balancing", .data = NULL, /* filled in by handler */