From patchwork Tue Jul 16 03:39:29 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= X-Patchwork-Id: 11045189 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 175D813B1 for ; Tue, 16 Jul 2019 03:39:37 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 02AF928451 for ; Tue, 16 Jul 2019 03:39:37 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E996628553; Tue, 16 Jul 2019 03:39:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64, MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE,TVD_PH_BODY_ACCOUNTS_PRE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 20D9328451 for ; Tue, 16 Jul 2019 03:39:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5440B6B0008; Mon, 15 Jul 2019 23:39:35 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 4F5276B000A; Mon, 15 Jul 2019 23:39:35 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3E4166B000C; Mon, 15 Jul 2019 23:39:35 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f197.google.com (mail-pf1-f197.google.com [209.85.210.197]) by kanga.kvack.org (Postfix) with ESMTP id 098266B0008 for ; Mon, 15 Jul 2019 23:39:35 -0400 (EDT) Received: by mail-pf1-f197.google.com with SMTP id h27so11492026pfq.17 for ; Mon, 15 Jul 2019 20:39:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:references:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=2xzKRS1NLr+aqh0erl+4XRn1hgTnnywqAcz2SHJsUXE=; b=Yl7JaJOlr7Tl2A/JhVOJtEVpFRRLZkYMmMixS6e+nCevkJ8valNQ+kqBhbs3fv5nKd WxOUosptd38NeBUYYpZVeYDfXyg356/X+Vfa2JVKfwrrb4WKcDN+syI+EqdFx+Zo2bH3 yV2TBDHWVOk2gayuE7ooN6cVrAk0MLXDA0qVIBm9BnA7SNuZ4oj9baAf6oAW8icHe8sM MKlGf2ri5hd+oBMRUa4zl85gRT9KG6K0+g2wzYf4FOsKCOYTfrdNSgIOvrs54MQbvQ1P 5k00JK0TFov7XJ2/JRs6iE87+qE6QZwGnSvW87jmVg9imwppp2KBz/X7o9DG1fU8kFkM S7BQ== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.57 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Gm-Message-State: APjAAAUFmpkz9VEIkSfbGsUndEnYE5CDXyBO+dvXWjm9ar4ihwcQeD1M 5+D37w9lO8uuOPQRi5u4QU8dCUhEoK/6CrBJDIXQAhGQ3VeUsIS6SvRX3QQokBjw9tb0lEOOQp6 wqssxzZ2NCDT6mKIdOvLsiwih2eE1GPPGxg1glBrcZCyklGeMsnmOFSUmTVsw/w5pYg== X-Received: by 2002:a17:902:6b86:: with SMTP id p6mr33205156plk.14.1563248374679; Mon, 15 Jul 2019 20:39:34 -0700 (PDT) X-Google-Smtp-Source: APXvYqyWMhORdpltq4abPnRI/NlHXL7trrTv9XwgAPOZyZ8h483yPAVyBTZDoeGBxok74LKyGt5i X-Received: by 2002:a17:902:6b86:: with SMTP id p6mr33205083plk.14.1563248373728; Mon, 15 Jul 2019 20:39:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1563248373; cv=none; d=google.com; s=arc-20160816; b=eX4dpPP0pMqMgHjHX8EFxmUjB8K4Iq0ybk0CPTIOa3XLp6lMBhQmSFHJZDmOBSur2L CTMRdcDaHThdbzRe/wZmL8kcrleg9uyMFUEvKd1Za/cJEQwn434tw5+zo4SL+QxpfCdm ZkOgHrpt5tys+dxKD1LJYH+a7RDDml4OJY5lcu3v8h7NNBMaT2d/CmbvmiImId6AnG8X UyPQeEl0+RviqWYOeUPiZyIjmy9tUkKhzVmL+TGGuynUlCUMnTjMKq7gyuooTSy5RUIc 3+rPVTW4Y6Va9A8DeOalsSMuV1kfM3HJQLH31CWyiTyUMFcZSroHjnZDceOB1tURbdpZ fyzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:content-language:in-reply-to:mime-version :user-agent:date:message-id:references:cc:to:from:subject; bh=2xzKRS1NLr+aqh0erl+4XRn1hgTnnywqAcz2SHJsUXE=; b=dc8CBpWgvZfvJ4Sji2Lf9CHmgNMZSsVkU76ldWvrIACjJb/JUwn7ip4pXEFIfV12fU KSsJkDWp3MSS3LCe5/2/cFf1ZozHFKrKjiGJBIgt5mcDwZWaIK8bHMo/wseiANx1bhqE sew12Rw745JMZLGQ5xUxEC2a+5ee+r0xiUw6oZKT0jKpBrzqC59J9dgzWt1jY9f0X9QE Mn5MUIwaiFnNlLFJEQezIlxZp8ahfaKfJISvo5lnANGOy64eRWZPO4XpBrQj3bm1Zxxo g+17Oj0pp0JY/gEGuFxfp6LWLEr2IVQP9S8MCFyHh8hOE6Fps0TzxvMfWbzjfN95ULSH CP1w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.57 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out30-57.freemail.mail.aliyun.com (out30-57.freemail.mail.aliyun.com. [115.124.30.57]) by mx.google.com with ESMTPS id i133si17736841pgc.109.2019.07.15.20.39.33 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 15 Jul 2019 20:39:33 -0700 (PDT) Received-SPF: pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.57 as permitted sender) client-ip=115.124.30.57; Authentication-Results: mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.57 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07417;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0TX1Yc3G_1563248369; Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com fp:SMTPD_---0TX1Yc3G_1563248369) by smtp.aliyun-inc.com(127.0.0.1); Tue, 16 Jul 2019 11:39:30 +0800 Subject: [PATCH v2 1/4] numa: introduce per-cgroup numa balancing locality statistic From: =?utf-8?b?546L6LSH?= To: Peter Zijlstra , hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, mcgrof@kernel.org, keescook@chromium.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, =?utf-8?q?Michal_Koutn=C3=BD?= , Hillf Danton References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com> <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com> <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com> Message-ID: <120ffcaa-0281-5d30-c0c1-9464d93e935f@linux.alibaba.com> Date: Tue, 16 Jul 2019 11:39:29 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.7.0 MIME-Version: 1.0 In-Reply-To: <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com> Content-Language: en-US X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP This patch introduced numa locality statistic, which try to imply the numa balancing efficiency per memory cgroup. On numa balancing, we trace the local page accessing ratio of tasks, which we call the locality. By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we see output line heading with 'locality', like: locality 15393 21259 13023 44461 21247 17012 28496 145402 locality divided into 8 regions, each number standing for the micro seconds we hit a task running with the locality within that region, for example here we have tasks with locality around 0~12% running for 15393 ms, and tasks with locality around 88~100% running for 145402 ms. By monitoring the increment, we can check if the workloads of a particular cgroup is doing well with numa, when most of the tasks are running in low locality region, then something is wrong with your numa policy. Signed-off-by: Michael Wang --- Since v1: * move implementation from memory cgroup into cpu group * introduce new entry 'numa_stat' to present locality * locality now accounting in hierarchical way * locality now accounted into 8 regions equally include/linux/sched.h | 8 +++++++- kernel/sched/core.c | 40 ++++++++++++++++++++++++++++++++++++++++ kernel/sched/debug.c | 7 +++++++ kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 29 +++++++++++++++++++++++++++++ 5 files changed, 132 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 907808f1acc5..eb26098de6ea 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1117,8 +1117,14 @@ struct task_struct { * scan window were remote/local or failed to migrate. The task scan * period is adapted based on the locality of the faults with different * weights depending on whether they were shared or private faults + * + * 0 -- remote faults + * 1 -- local faults + * 2 -- page migration failure + * 3 -- remote page accessing + * 4 -- local page accessing */ - unsigned long numa_faults_locality[3]; + unsigned long numa_faults_locality[5]; unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fa43ce3962e7..71a8d3ed8495 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6367,6 +6367,10 @@ static struct kmem_cache *task_group_cache __read_mostly; DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); DECLARE_PER_CPU(cpumask_var_t, select_idle_mask); +#ifdef CONFIG_NUMA_BALANCING +DECLARE_PER_CPU(struct numa_stat, root_numa_stat); +#endif + void __init sched_init(void) { unsigned long alloc_size = 0, ptr; @@ -6416,6 +6420,10 @@ void __init sched_init(void) init_defrootdomain(); #endif +#ifdef CONFIG_NUMA_BALANCING + root_task_group.numa_stat = &root_numa_stat; +#endif + #ifdef CONFIG_RT_GROUP_SCHED init_rt_bandwidth(&root_task_group.rt_bandwidth, global_rt_period(), global_rt_runtime()); @@ -6727,6 +6735,7 @@ static DEFINE_SPINLOCK(task_group_lock); static void sched_free_group(struct task_group *tg) { + free_tg_numa_stat(tg); free_fair_sched_group(tg); free_rt_sched_group(tg); autogroup_free(tg); @@ -6742,6 +6751,9 @@ struct task_group *sched_create_group(struct task_group *parent) if (!tg) return ERR_PTR(-ENOMEM); + if (!alloc_tg_numa_stat(tg)) + goto err; + if (!alloc_fair_sched_group(tg, parent)) goto err; @@ -7277,6 +7289,28 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, } #endif /* CONFIG_RT_GROUP_SCHED */ +#ifdef CONFIG_NUMA_BALANCING +static int cpu_numa_stat_show(struct seq_file *sf, void *v) +{ + int nr; + struct task_group *tg = css_tg(seq_css(sf)); + + seq_puts(sf, "locality"); + for (nr = 0; nr < NR_NL_INTERVAL; nr++) { + int cpu; + u64 sum = 0; + + for_each_possible_cpu(cpu) + sum += per_cpu(tg->numa_stat->locality[nr], cpu); + + seq_printf(sf, " %u", jiffies_to_msecs(sum)); + } + seq_putc(sf, '\n'); + + return 0; +} +#endif + static struct cftype cpu_legacy_files[] = { #ifdef CONFIG_FAIR_GROUP_SCHED { @@ -7312,6 +7346,12 @@ static struct cftype cpu_legacy_files[] = { .read_u64 = cpu_rt_period_read_uint, .write_u64 = cpu_rt_period_write_uint, }, +#endif +#ifdef CONFIG_NUMA_BALANCING + { + .name = "numa_stat", + .seq_show = cpu_numa_stat_show, + }, #endif { } /* Terminate */ }; diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index f7e4579e746c..a22b2a62aee2 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -848,6 +848,13 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m) P(total_numa_faults); SEQ_printf(m, "current_node=%d, numa_group_id=%d\n", task_node(p), task_numa_group_id(p)); + SEQ_printf(m, "faults_locality local=%lu remote=%lu failed=%lu ", + p->numa_faults_locality[1], + p->numa_faults_locality[0], + p->numa_faults_locality[2]); + SEQ_printf(m, "lhit=%lu rhit=%lu\n", + p->numa_faults_locality[4], + p->numa_faults_locality[3]); show_numa_stats(p, m); mpol_put(pol); #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 036be95a87e9..cd716355d70e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2449,6 +2449,12 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages; p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages; p->numa_faults_locality[local] += pages; + /* + * We want to have the real local/remote page access statistic + * here, so use 'mem_node' which is the real residential node of + * page after migrate_misplaced_page(). + */ + p->numa_faults_locality[3 + !!(mem_node == numa_node_id())] += pages; } static void reset_ptenuma_scan(struct task_struct *p) @@ -2611,6 +2617,47 @@ void task_numa_work(struct callback_head *work) } } +DEFINE_PER_CPU(struct numa_stat, root_numa_stat); + +int alloc_tg_numa_stat(struct task_group *tg) +{ + tg->numa_stat = alloc_percpu(struct numa_stat); + if (!tg->numa_stat) + return 0; + + return 1; +} + +void free_tg_numa_stat(struct task_group *tg) +{ + free_percpu(tg->numa_stat); +} + +static void update_tg_numa_stat(struct task_struct *p) +{ + struct task_group *tg; + unsigned long remote = p->numa_faults_locality[3]; + unsigned long local = p->numa_faults_locality[4]; + int idx = -1; + + /* Tobe scaled? */ + if (remote || local) + idx = NR_NL_INTERVAL * local / (remote + local + 1); + + rcu_read_lock(); + + tg = task_group(p); + while (tg) { + /* skip account when there are no faults records */ + if (idx != -1) + this_cpu_inc(tg->numa_stat->locality[idx]); + + tg = tg->parent; + } + + rcu_read_unlock(); +} + /* * Drive the periodic memory faults.. */ @@ -2625,6 +2672,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr) if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work) return; + update_tg_numa_stat(curr); + /* * Using runtime rather than walltime has the dual advantage that * we (mostly) drive the selection from busy threads and that the diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 802b1f3405f2..685a9e670880 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -353,6 +353,17 @@ struct cfs_bandwidth { #endif }; +#ifdef CONFIG_NUMA_BALANCING + +/* NUMA Locality Interval, 8 bucket for cache align */ +#define NR_NL_INTERVAL 8 + +struct numa_stat { + u64 locality[NR_NL_INTERVAL]; +}; + +#endif + /* Task group related information */ struct task_group { struct cgroup_subsys_state css; @@ -393,8 +404,26 @@ struct task_group { #endif struct cfs_bandwidth cfs_bandwidth; + +#ifdef CONFIG_NUMA_BALANCING + struct numa_stat __percpu *numa_stat; +#endif }; +#ifdef CONFIG_NUMA_BALANCING +int alloc_tg_numa_stat(struct task_group *tg); +void free_tg_numa_stat(struct task_group *tg); +#else +static int alloc_tg_numa_stat(struct task_group *tg) +{ + return 1; +} + +static void free_tg_numa_stat(struct task_group *tg) +{ +} +#endif + #ifdef CONFIG_FAIR_GROUP_SCHED #define ROOT_TASK_GROUP_LOAD NICE_0_LOAD From patchwork Tue Jul 16 03:40:35 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= X-Patchwork-Id: 11045193 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DA7F113B1 for ; Tue, 16 Jul 2019 03:40:43 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BA5E22018E for ; Tue, 16 Jul 2019 03:40:43 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A82F126E4A; Tue, 16 Jul 2019 03:40:43 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64, MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 268FA2018E for ; Tue, 16 Jul 2019 03:40:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6CA656B0006; Mon, 15 Jul 2019 23:40:42 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 653956B0008; Mon, 15 Jul 2019 23:40:42 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 543456B000A; Mon, 15 Jul 2019 23:40:42 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f197.google.com (mail-pf1-f197.google.com [209.85.210.197]) by kanga.kvack.org (Postfix) with ESMTP id 1DBE66B0006 for ; Mon, 15 Jul 2019 23:40:42 -0400 (EDT) Received: by mail-pf1-f197.google.com with SMTP id 145so11514004pfw.16 for ; Mon, 15 Jul 2019 20:40:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:references:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=sfGqnlwv/x1QcG6+i4QvvMRPlbSgoi7oOhYWSwfokLo=; b=MgzHfizcuBPcOSPHBcz5q5ksMzrWIzag5M9Xo5DW+5AjupAQE0RKexyHKDQg7DNKVv TInPohIg93HT6xg3Aiu+9SbUJ8L3el19zzGp6RObb+JnsjvYP/u2GMD8UHgZguNGyQSH ggfewzqZGKoyGgDjHHJd4DyyAgB1JipWuwro46Bkhk4ahlkO43y6Njyk3PBE54R0bI95 +WMocnzlE/8ckFJVutnW9xz+zucbWTNx+ocwf/9+xRJLDqsz05LimjzYHB1uktDdBx5b kxSNn8ROExtGCUsrBaBu26p55cSIO01+S4Q/qxMh9cJ06j/dZ4bzXBvXATRGFuoTld7F tXJg== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.43 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Gm-Message-State: APjAAAW+IwR3hPDYkL8BA7PxP1zf6hMfyt4DNXEHRuPLQ0pHv5SoatZK 9QlYacXxy6nSoL2EQPScGBBcP/ShKwakXEmMPoh8uVVNE6nk85zFkJ5YWCazxU72YSyQZIEwRd1 RdSBqj7f1wnrcB3PWZbrSypfbsYhoJkiGCPXCe29DxlFwhj3CavOy3p4pLG+7bTqC0Q== X-Received: by 2002:a17:902:44f:: with SMTP id 73mr32751312ple.192.1563248441787; Mon, 15 Jul 2019 20:40:41 -0700 (PDT) X-Google-Smtp-Source: APXvYqzMIWYRitHGPZ/4PA65LkdM4VVPmez3ZsJcsVFIN8bzR+Ub+lbNH9R/UcD/wYcjspCWFhVv X-Received: by 2002:a17:902:44f:: with SMTP id 73mr32751203ple.192.1563248440578; Mon, 15 Jul 2019 20:40:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1563248440; cv=none; d=google.com; s=arc-20160816; b=QYbPaW58yH0+w6MtOP6pIbx7bgMp9b8WmKylexe4jDg8w3KbWSNBg1h2+qt+XRE/bq t5j0GTUm0QxR8cmaHq3RmkdH6/gC78KBS8pWiLHChs7AtP1pHh+8smpr5uFf6B4mNfT6 PrYpY+us88ExjijvphPCnf8UFuwYTkWrg9p8Rp0FaNfUXNMz9PFKx+Lv5qXHnRi7KQQB Kp6U7vtcDwt4WtSWOIWKSVS0Y3Kfm0aVo7ecKHetY2TI4HwaS4sAMm8HWpl8ZO6fMQTh Uv+RmLeWnuuSWfPJ3Yz3CcHBD4qd23x4YfLq7o5NTkRezCVOdKc0fVlR0wFz3CNQFXjY KmgA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:content-language:in-reply-to:mime-version :user-agent:date:message-id:references:cc:to:from:subject; bh=sfGqnlwv/x1QcG6+i4QvvMRPlbSgoi7oOhYWSwfokLo=; b=Z32v891zaDFDyPeMeCxQ0k3konpR7l/ul1mOnUCheNMdcvErv3/KFsMn8sehxT11Yz TRZd1IF092uaOMMuJF9wLHDNktB4Cw/Ln6XpNyKH9qOiUGYZ06XWrc91HNkiQ7LnePu4 mPYkC7fyezzYUlY+fapmqyCk+xP4Zk9RTHNs7GFG8hYET7cb8m2F9yZ34uZnjicP6uoe 3rtG1dVXiPZR5wf4h341Rh3XXGY4KDAPcbdccFzb7T8kNFl4EhQviGksubi5fD5Ad9lG JigexxXMcdv6EtOZVVIrjynlxNXtvz1Gy8L+7/42sElXZEdq3B59jpp1Z1Zpvww8MlZj lcMA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.43 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out30-43.freemail.mail.aliyun.com (out30-43.freemail.mail.aliyun.com. [115.124.30.43]) by mx.google.com with ESMTPS id q42si18227829pjc.103.2019.07.15.20.40.39 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 15 Jul 2019 20:40:40 -0700 (PDT) Received-SPF: pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.43 as permitted sender) client-ip=115.124.30.43; Authentication-Results: mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.43 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R531e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01422;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0TX1TAFc_1563248435; Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com fp:SMTPD_---0TX1TAFc_1563248435) by smtp.aliyun-inc.com(127.0.0.1); Tue, 16 Jul 2019 11:40:36 +0800 Subject: [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat From: =?utf-8?b?546L6LSH?= To: Peter Zijlstra , hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, mcgrof@kernel.org, keescook@chromium.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, =?utf-8?q?Michal_Koutn=C3=BD?= , Hillf Danton References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com> <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com> <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com> Message-ID: <6973a1bf-88f2-b54e-726d-8b7d95d80197@linux.alibaba.com> Date: Tue, 16 Jul 2019 11:40:35 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.7.0 MIME-Version: 1.0 In-Reply-To: <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com> Content-Language: en-US X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP This patch introduced numa execution time information, to imply the numa efficiency. By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we see new output line heading with 'exectime', like: exectime 311900 407166 which means the tasks of this cgroup executed 311900 micro seconds on node 0, and 407166 ms on node 1. Combined with the memory node info from memory cgroup, we can estimate the numa efficiency, for example if the memory.numa_stat show: total=206892 N0=21933 N1=185171 By monitoring the increments, if the topology keep in this way and locality is not nice, then it imply numa balancing can't help migrate the memory from node 1 to 0 which is accessing by tasks on node 0, or tasks can't migrate to node 1 for some reason, then you may consider to bind the workloads on the cpus of node 1. Signed-off-by: Michael Wang --- Since v1: * move implementation from memory cgroup into cpu group * exectime now accounting in hierarchical way * change member name into jiffies kernel/sched/core.c | 12 ++++++++++++ kernel/sched/fair.c | 2 ++ kernel/sched/sched.h | 1 + 3 files changed, 15 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 71a8d3ed8495..f8aa73aa879b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7307,6 +7307,18 @@ static int cpu_numa_stat_show(struct seq_file *sf, void *v) } seq_putc(sf, '\n'); + seq_puts(sf, "exectime"); + for_each_online_node(nr) { + int cpu; + u64 sum = 0; + + for_each_cpu(cpu, cpumask_of_node(nr)) + sum += per_cpu(tg->numa_stat->jiffies, cpu); + + seq_printf(sf, " %u", jiffies_to_msecs(sum)); + } + seq_putc(sf, '\n'); + return 0; } #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cd716355d70e..2c362266af76 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2652,6 +2652,8 @@ static void update_tg_numa_stat(struct task_struct *p) if (idx != -1) this_cpu_inc(tg->numa_stat->locality[idx]); + this_cpu_inc(tg->numa_stat->jiffies); + tg = tg->parent; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 685a9e670880..456f83f7f595 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -360,6 +360,7 @@ struct cfs_bandwidth { struct numa_stat { u64 locality[NR_NL_INTERVAL]; + u64 jiffies; }; #endif From patchwork Tue Jul 16 03:41:07 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?546L6LSH?= X-Patchwork-Id: 11045197 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1FE971510 for ; Tue, 16 Jul 2019 03:41:14 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0910A28451 for ; Tue, 16 Jul 2019 03:41:14 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EE18928553; Tue, 16 Jul 2019 03:41:13 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64, MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 29F5D28451 for ; Tue, 16 Jul 2019 03:41:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 606D56B0008; Mon, 15 Jul 2019 23:41:12 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 5906F6B000A; Mon, 15 Jul 2019 23:41:12 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 409996B000C; Mon, 15 Jul 2019 23:41:12 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pg1-f197.google.com (mail-pg1-f197.google.com [209.85.215.197]) by kanga.kvack.org (Postfix) with ESMTP id 0723F6B0008 for ; Mon, 15 Jul 2019 23:41:12 -0400 (EDT) Received: by mail-pg1-f197.google.com with SMTP id n9so8367478pgq.4 for ; Mon, 15 Jul 2019 20:41:11 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:references:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=/74LPF98aYvyOsre9S5738ciDn/0Iy8z/GqovToe4tw=; b=bSSoCFuYtQB0ez1UnEmypRKsjZ7QGwRI5eM3oHwwq7n2/RSw71a6+gNbXFu5qeElVz LbXBptd2PBU593Dj7rEKXNpC+KEbzXx1mQ0Z1uFWtNykOqRhpoLYli5mzi88pW/eZGgo TPU7iAzpfhESKJHUSu67FmZuEyQnzJGb6NiGXKMPYkrFFDgf++VqxA1BjgQKIB3DrLYG 29Wpb1jqKQgNCedTmrczFjRPUqHaHOLB0/37cyj3s3HCiwxEeBip6D+zbztvmBNNRpu4 SgfMiUw8fHQC+MiQCQeMpGdbH3oNIKLNvj8qsdab370rCE13ve0+rYWeocgfAKBnjN/n 1CXg== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.131 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Gm-Message-State: APjAAAVw1DsL4iphQ442si5P2N1z8cZt65bzEpnFqtYGvVU9WXijHUla Uxjex1QOsKLZCpfra65KuiWBnWs1x2VBZG0MtxJ9pUp5SgtRqBv2DYFcLdcPC9hK2ucHqUJROpV D+6llilGYUWk0OmqCFHld4FfqlQHozhtpX5Cy+DcxKHlBzFDCKyg4KoBuKDtQ765gwQ== X-Received: by 2002:a17:90a:cb97:: with SMTP id a23mr32605697pju.67.1563248471690; Mon, 15 Jul 2019 20:41:11 -0700 (PDT) X-Google-Smtp-Source: APXvYqz0LhoY1RbKl/4Q/fMxg8Wa14heOUGRaoEWh4W9ACR2/ZDJvyCsVTK4+X4wsz/Tzeo5dh9I X-Received: by 2002:a17:90a:cb97:: with SMTP id a23mr32605627pju.67.1563248470708; Mon, 15 Jul 2019 20:41:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1563248470; cv=none; d=google.com; s=arc-20160816; b=CMuN7JlWAuJl2dqDbSDtPQiDc2XHBWeLWnCtiOnxMmAqVxghE4MEhqj+GuEpSqZYLv KsCSknAqvPknr3xpDHSYLT2SYuCIJOAbtrtLvGIpDr69eevJtTIAHWEVqNzUXbowJ4Ni kWJXfQ1O5/s3vjbRSaE74pupGQeityNJQKwLZr+4noveoMOtKx2ZJyEvKFGQxgdGhyNN +WFhY61KCRXatjV3S6fDRTZgYfUQ1zBbyNS/SLqRkn+uLRJ9dibGGhL591AP22ise1K4 2okf9B1GdnZ/Bj45gMjH56VkkXRwcXPMSZlQQg5ojGO4Z3P/1rIJtRI++yIyM33PzOBX rmYQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:content-language:in-reply-to:mime-version :user-agent:date:message-id:references:cc:to:from:subject; bh=/74LPF98aYvyOsre9S5738ciDn/0Iy8z/GqovToe4tw=; b=DQnry6JhvrEi/36jUSNokTEjygs7FpkvBq6CeP2on5A9lj+sUwtNKOifnloCEcuGO0 t2B4f4AweXyq5Xzt/XKxN/oM1Ju0L8+83B/cZMpmcETJYr3CK54NRhw7ohVD2fvsSE/4 UsC5ZKxKKeRo1tmDFKrtJhaFTuODIW9O5MRqCDa91zjplU/XX4rx8/WlE7SglWwV7/EO tx48pOBVz1Vbz/f4P5OoCzzTlBO9g0zWqR4pEEEhIlxbViycKzg6PBus0HjIJk4uBSw2 jb3zHuZII/TULrs4uDn+iuPir8Kernxyt1BIBX9OJPOlbkQo1EJVXSmu5eJAGFrZ4FQV oWgg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.131 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out30-131.freemail.mail.aliyun.com (out30-131.freemail.mail.aliyun.com. [115.124.30.131]) by mx.google.com with ESMTPS id v11si17272305plo.223.2019.07.15.20.41.10 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 15 Jul 2019 20:41:10 -0700 (PDT) Received-SPF: pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.131 as permitted sender) client-ip=115.124.30.131; Authentication-Results: mx.google.com; spf=pass (google.com: domain of yun.wang@linux.alibaba.com designates 115.124.30.131 as permitted sender) smtp.mailfrom=yun.wang@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R191e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0TX1aoqx_1563248467; Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com fp:SMTPD_---0TX1aoqx_1563248467) by smtp.aliyun-inc.com(127.0.0.1); Tue, 16 Jul 2019 11:41:07 +0800 Subject: [PATCH v2 3/4] numa: introduce numa group per task group From: =?utf-8?b?546L6LSH?= To: Peter Zijlstra , hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, mcgrof@kernel.org, keescook@chromium.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, =?utf-8?q?Michal_Koutn=C3=BD?= , Hillf Danton References: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com> <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com> <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com> Message-ID: Date: Tue, 16 Jul 2019 11:41:07 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.7.0 MIME-Version: 1.0 In-Reply-To: <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com> Content-Language: en-US X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP By tracing numa page faults, we recognize tasks sharing the same page, and try pack them together into a single numa group. However when two task share lot's of cache pages while not much anonymous pages, since numa balancing do not tracing cache page, they have no chance to join into the same group. While tracing cache page cost too much, we could use some hints from userland and cpu cgroup could be a good one. This patch introduced new entry 'numa_group' for cpu cgroup, by echo non-zero into the entry, we can now force all the tasks of this cgroup to join the same numa group serving for task group. In this way tasks are more likely to settle down on the same node, to share closer cpu cache and gain benefit from NUMA on both file/anonymous pages. Besides, when multiple cgroup enabled numa group, they will be able to exchange task location by utilizing numa migration, in this way they could achieve single node settle down without breaking load balance. Signed-off-by: Michael Wang --- Since v1: * just rebase, no logical changes kernel/sched/core.c | 33 ++++++++++ kernel/sched/fair.c | 175 ++++++++++++++++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 11 ++++ 3 files changed, 218 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f8aa73aa879b..9f100c48d6e4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6802,6 +6802,8 @@ void sched_offline_group(struct task_group *tg) { unsigned long flags; + update_tg_numa_group(tg, false); + /* End participation in shares distribution: */ unregister_fair_sched_group(tg); @@ -7321,6 +7323,32 @@ static int cpu_numa_stat_show(struct seq_file *sf, void *v) return 0; } + +static DEFINE_MUTEX(numa_mutex); + +static int cpu_numa_group_show(struct seq_file *sf, void *v) +{ + struct task_group *tg = css_tg(seq_css(sf)); + + mutex_lock(&numa_mutex); + show_tg_numa_group(tg, sf); + mutex_unlock(&numa_mutex); + + return 0; +} + +static int cpu_numa_group_write_s64(struct cgroup_subsys_state *css, + struct cftype *cft, s64 numa_group) +{ + int ret; + struct task_group *tg = css_tg(css); + + mutex_lock(&numa_mutex); + ret = update_tg_numa_group(tg, numa_group); + mutex_unlock(&numa_mutex); + + return ret; +} #endif static struct cftype cpu_legacy_files[] = { @@ -7364,6 +7392,11 @@ static struct cftype cpu_legacy_files[] = { .name = "numa_stat", .seq_show = cpu_numa_stat_show, }, + { + .name = "numa_group", + .write_s64 = cpu_numa_group_write_s64, + .seq_show = cpu_numa_group_show, + }, #endif { } /* Terminate */ }; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2c362266af76..c28ba040a563 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1073,6 +1073,7 @@ struct numa_group { int nr_tasks; pid_t gid; int active_nodes; + bool evacuate; struct rcu_head rcu; unsigned long total_faults; @@ -2246,6 +2247,176 @@ static inline void put_numa_group(struct numa_group *grp) kfree_rcu(grp, rcu); } +void show_tg_numa_group(struct task_group *tg, struct seq_file *sf) +{ + int nid; + struct numa_group *ng = tg->numa_group; + + if (!ng) { + seq_puts(sf, "disabled\n"); + return; + } + + seq_printf(sf, "id %d nr_tasks %d active_nodes %d\n", + ng->gid, ng->nr_tasks, ng->active_nodes); + + for_each_online_node(nid) { + int f_idx = task_faults_idx(NUMA_MEM, nid, 0); + int pf_idx = task_faults_idx(NUMA_MEM, nid, 1); + + seq_printf(sf, "node %d ", nid); + + seq_printf(sf, "mem_private %lu mem_shared %lu ", + ng->faults[f_idx], ng->faults[pf_idx]); + + seq_printf(sf, "cpu_private %lu cpu_shared %lu\n", + ng->faults_cpu[f_idx], ng->faults_cpu[pf_idx]); + } +} + +int update_tg_numa_group(struct task_group *tg, bool numa_group) +{ + struct numa_group *ng = tg->numa_group; + + /* if no change then do nothing */ + if ((ng != NULL) == numa_group) + return 0; + + if (ng) { + /* put and evacuate tg's numa group */ + rcu_assign_pointer(tg->numa_group, NULL); + ng->evacuate = true; + put_numa_group(ng); + } else { + unsigned int size = sizeof(struct numa_group) + + 4*nr_node_ids*sizeof(unsigned long); + + ng = kzalloc(size, GFP_KERNEL | __GFP_NOWARN); + if (!ng) + return -ENOMEM; + + refcount_set(&ng->refcount, 1); + spin_lock_init(&ng->lock); + ng->faults_cpu = ng->faults + NR_NUMA_HINT_FAULT_TYPES * + nr_node_ids; + /* now make tasks see and join */ + rcu_assign_pointer(tg->numa_group, ng); + } + + return 0; +} + +static bool tg_numa_group(struct task_struct *p) +{ + int i; + struct task_group *tg; + struct numa_group *grp, *my_grp; + + rcu_read_lock(); + + tg = task_group(p); + if (!tg) + goto no_join; + + grp = rcu_dereference(tg->numa_group); + my_grp = rcu_dereference(p->numa_group); + + if (!grp) + goto no_join; + + if (grp == my_grp) { + if (!grp->evacuate) + goto joined; + + /* + * Evacuate task from tg's numa group + */ + rcu_read_unlock(); + + spin_lock_irq(&grp->lock); + + for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) + grp->faults[i] -= p->numa_faults[i]; + + grp->total_faults -= p->total_numa_faults; + grp->nr_tasks--; + + spin_unlock_irq(&grp->lock); + + rcu_assign_pointer(p->numa_group, NULL); + + put_numa_group(grp); + + return false; + } + + if (!get_numa_group(grp)) + goto no_join; + + rcu_read_unlock(); + + /* + * Just join tg's numa group + */ + if (!my_grp) { + spin_lock_irq(&grp->lock); + + if (refcount_read(&grp->refcount) == 2) { + grp->gid = p->pid; + grp->active_nodes = 1; + grp->max_faults_cpu = 0; + } + + for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) + grp->faults[i] += p->numa_faults[i]; + + grp->total_faults += p->total_numa_faults; + grp->nr_tasks++; + + spin_unlock_irq(&grp->lock); + rcu_assign_pointer(p->numa_group, grp); + + return true; + } + + /* + * Switch from the task's numa group to the tg's + */ + double_lock_irq(&my_grp->lock, &grp->lock); + + if (refcount_read(&grp->refcount) == 2) { + grp->gid = p->pid; + grp->active_nodes = 1; + grp->max_faults_cpu = 0; + } + + for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) { + my_grp->faults[i] -= p->numa_faults[i]; + grp->faults[i] += p->numa_faults[i]; + } + + my_grp->total_faults -= p->total_numa_faults; + grp->total_faults += p->total_numa_faults; + + my_grp->nr_tasks--; + grp->nr_tasks++; + + spin_unlock(&my_grp->lock); + spin_unlock_irq(&grp->lock); + + rcu_assign_pointer(p->numa_group, grp); + + put_numa_group(my_grp); + return true; + +joined: + rcu_read_unlock(); + return true; +no_join: + rcu_read_unlock(); + return false; +} + static void task_numa_group(struct task_struct *p, int cpupid, int flags, int *priv) { @@ -2416,7 +2587,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) priv = 1; } else { priv = cpupid_match_pid(p, last_cpupid); - if (!priv && !(flags & TNF_NO_GROUP)) + if (tg_numa_group(p)) + priv = (flags & TNF_SHARED) ? 0 : priv; + else if (!priv && !(flags & TNF_NO_GROUP)) task_numa_group(p, last_cpupid, flags, &priv); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 456f83f7f595..23e4a62cd37b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -408,6 +408,7 @@ struct task_group { #ifdef CONFIG_NUMA_BALANCING struct numa_stat __percpu *numa_stat; + void *numa_group; #endif }; @@ -1316,11 +1317,21 @@ extern int migrate_task_to(struct task_struct *p, int cpu); extern int migrate_swap(struct task_struct *p, struct task_struct *t, int cpu, int scpu); extern void init_numa_balancing(unsigned long clone_flags, struct task_struct *p); +extern void show_tg_numa_group(struct task_group *tg, struct seq_file *sf); +extern int update_tg_numa_group(struct task_group *tg, bool numa_group); #else static inline void init_numa_balancing(unsigned long clone_flags, struct task_struct *p) { } +static inline void +show_tg_numa_group(struct task_group *tg, struct seq_file *sf) +{ +} +update_tg_numa_group(struct task_group *tg, bool numa_group) +{ + return 0; +} #endif /* CONFIG_NUMA_BALANCING */ #ifdef CONFIG_SMP