From patchwork Wed Oct 4 16:18:50 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Zijlstra X-Patchwork-Id: 9985135 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 52A916028E for ; Wed, 4 Oct 2017 16:19:15 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4297D28B4B for ; Wed, 4 Oct 2017 16:19:15 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3743728B5B; Wed, 4 Oct 2017 16:19:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 23C0528B4B for ; Wed, 4 Oct 2017 16:19:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751337AbdJDQS6 (ORCPT ); Wed, 4 Oct 2017 12:18:58 -0400 Received: from bombadil.infradead.org ([65.50.211.133]:55031 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750937AbdJDQS4 (ORCPT ); Wed, 4 Oct 2017 12:18:56 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=z1o8/WriJPjX6Yu70O6lrlWiHQOJPkGhM6Nv00GUlII=; b=lDXjsDOn89o/wXpq3Yq8OwdEe DmEwzxPSqe4OaAZ2lvLPraFqF3Dksufkyjn0oVWjJN53Gnjgupc/3pOniCbZhPbDkH1LnVxNYoZoD mYoKgQYKNnp5il03em8EMzUR/V37kxgw8jlMoLZSpWdCQxxWvbzya6iN4DAzmie5m+9pCZ9WLThSL QH3qawEcvwqEJ5n/noE4xQW36Oe8UHS5vacvaGOx+M7L5NKFN38BvthQeBaVjAZI6q79Q+lWx+5UO sOPscldHcH6rqSYkgsbUQAkb0El9azM51u2e0xwFAjA1iPr4xLZS2qiq4mV3tXe63T7lINhsTz1JW XWEgKqNjw==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=hirez.programming.kicks-ass.net) by bombadil.infradead.org with esmtpsa (Exim 4.87 #1 (Red Hat Linux)) id 1dzmNo-00005c-6G; Wed, 04 Oct 2017 16:18:52 +0000 Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 7306520275844; Wed, 4 Oct 2017 18:18:50 +0200 (CEST) Date: Wed, 4 Oct 2017 18:18:50 +0200 From: Peter Zijlstra To: Matt Fleming Cc: Rik van Riel , Eric Farman , ????????? , LKML , Ingo Molnar , Christian Borntraeger , "KVM-ML (kvm@vger.kernel.org)" , vcaputo@pengaru.com, Matthew Rosato Subject: Re: sysbench throughput degradation in 4.13+ Message-ID: <20171004161850.wgnu73dokpjfyfdk@hirez.programming.kicks-ass.net> References: <95edafb1-5e9d-8461-db73-bcb002b7ebef@linux.vnet.ibm.com> <50a279d3-84eb-3403-f2f0-854934778037@linux.vnet.ibm.com> <20170922155348.zujigkn3o5eylctn@hirez.programming.kicks-ass.net> <754f5a9f-5332-148d-2631-918fc7a7cfe9@linux.vnet.ibm.com> <20170927093530.s3sgdz2vamc5ka4w@hirez.programming.kicks-ass.net> <20170927135820.61cd077f@cuia.usersys.redhat.com> <20171002225312.GA24578@codeblueprint.co.uk> <20171003083932.3qa7jw2spmi5n5pg@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20171003083932.3qa7jw2spmi5n5pg@hirez.programming.kicks-ass.net> User-Agent: NeoMutt/20170609 (1.8.3) Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote: > So I was waiting for Rik, who promised to run a bunch of NUMA workloads > over the weekend. > > The trivial thing regresses a wee bit on the overloaded case, I've not > yet tried to fix it. WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the addition -- based on the old scheme -- that I've tried in order to lift the overloaded case (including hackbench). Its not an unconditional win, but I'm tempted to default enable WA_WEIGHT too (I've not done NO_WA_IDLE && WA_WEIGHT runs). But let me first write a Changelog for the below and queue that. Then we ran maybe run more things.. On my IVB-EP (2 nodes, 10 cores/node, 2 threads/core): WA_IDLE && NO_WA_WEIGHT: Performance counter stats for 'perf bench sched messaging -g 20 -t -l 10000' (10 runs): 7.391856936 seconds time elapsed ( +- 0.66% ) [ ok ] Starting network benchmark server. TCP_SENDFILE-1 : Avg: 54524.6 TCP_SENDFILE-10 : Avg: 48185.2 TCP_SENDFILE-20 : Avg: 29031.2 TCP_SENDFILE-40 : Avg: 9819.72 TCP_SENDFILE-80 : Avg: 5355.3 TCP_STREAM-1 : Avg: 41448.3 TCP_STREAM-10 : Avg: 24123.2 TCP_STREAM-20 : Avg: 15834.5 TCP_STREAM-40 : Avg: 5583.91 TCP_STREAM-80 : Avg: 2329.66 TCP_RR-1 : Avg: 80473.5 TCP_RR-10 : Avg: 72660.5 TCP_RR-20 : Avg: 52607.1 TCP_RR-40 : Avg: 57199.2 TCP_RR-80 : Avg: 25330.3 UDP_RR-1 : Avg: 108266 UDP_RR-10 : Avg: 95480 UDP_RR-20 : Avg: 68770.8 UDP_RR-40 : Avg: 76231 UDP_RR-80 : Avg: 34578.3 UDP_STREAM-1 : Avg: 64684.3 UDP_STREAM-10 : Avg: 52701.2 UDP_STREAM-20 : Avg: 30376.4 UDP_STREAM-40 : Avg: 15685.8 UDP_STREAM-80 : Avg: 8415.13 [ ok ] Stopping network benchmark server. [....] Starting MySQL database server: mysqldNo directory, logging in with HOME=/. ok 2: [30 secs] transactions: 64057 (2135.17 per sec.) 5: [30 secs] transactions: 144295 (4809.68 per sec.) 10: [30 secs] transactions: 274768 (9158.59 per sec.) 20: [30 secs] transactions: 437140 (14570.70 per sec.) 40: [30 secs] transactions: 663949 (22130.56 per sec.) 80: [30 secs] transactions: 629927 (20995.56 per sec.) [ ok ] Stopping MySQL database server: mysqld. [ ok ] Starting PostgreSQL 9.4 database server: main. 2: [30 secs] transactions: 50389 (1679.58 per sec.) 5: [30 secs] transactions: 113934 (3797.69 per sec.) 10: [30 secs] transactions: 217606 (7253.22 per sec.) 20: [30 secs] transactions: 335021 (11166.75 per sec.) 40: [30 secs] transactions: 518355 (17277.28 per sec.) 80: [30 secs] transactions: 513424 (17112.44 per sec.) [ ok ] Stopping PostgreSQL 9.4 database server: main. Latency percentiles (usec) 50.0000th: 2 75.0000th: 3 90.0000th: 3 95.0000th: 3 *99.0000th: 3 99.5000th: 3 99.9000th: 4 min=0, max=86 avg worker transfer: 190227.78 ops/sec 743.08KB/s rps: 1004.94 p95 (usec) 6136 p99 (usec) 6152 p95/cputime 20.45% p99/cputime 20.51% rps: 1052.58 p95 (usec) 7208 p99 (usec) 7224 p95/cputime 24.03% p99/cputime 24.08% rps: 1076.40 p95 (usec) 7720 p99 (usec) 7736 p95/cputime 25.73% p99/cputime 25.79% rps: 1100.27 p95 (usec) 8208 p99 (usec) 8208 p95/cputime 27.36% p99/cputime 27.36% rps: 1147.96 p95 (usec) 9104 p99 (usec) 9136 p95/cputime 30.35% p99/cputime 30.45% rps: 1171.78 p95 (usec) 9552 p99 (usec) 9552 p95/cputime 31.84% p99/cputime 31.84% rps: 1220.04 p95 (usec) 12336 p99 (usec) 12336 p95/cputime 41.12% p99/cputime 41.12% rps: 1243.82 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97% rps: 1243.88 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97% rps: 1266.39 p95 (usec) 227584 p99 (usec) 239360 p95/cputime 758.61% p99/cputime 797.87% Latency percentiles (usec) 50.0000th: 62 75.0000th: 101 90.0000th: 108 95.0000th: 112 *99.0000th: 119 99.5000th: 124 99.9000th: 4920 min=0, max=12987 Throughput 664.328 MB/sec 2 clients 2 procs max_latency=0.076 ms Throughput 1573.72 MB/sec 5 clients 5 procs max_latency=0.102 ms Throughput 2948.7 MB/sec 10 clients 10 procs max_latency=0.198 ms Throughput 4602.38 MB/sec 20 clients 20 procs max_latency=1.712 ms Throughput 9253.17 MB/sec 40 clients 40 procs max_latency=2.047 ms Throughput 8056.01 MB/sec 80 clients 80 procs max_latency=35.819 ms ----------------------- WA_IDLE && WA_WEIGHT: Performance counter stats for 'perf bench sched messaging -g 20 -t -l 10000' (10 runs): 6.500797532 seconds time elapsed ( +- 0.97% ) [ ok ] Starting network benchmark server. TCP_SENDFILE-1 : Avg: 52224.3 TCP_SENDFILE-10 : Avg: 46504.3 TCP_SENDFILE-20 : Avg: 28610.3 TCP_SENDFILE-40 : Avg: 9253.12 TCP_SENDFILE-80 : Avg: 4687.4 TCP_STREAM-1 : Avg: 42254 TCP_STREAM-10 : Avg: 25847.9 TCP_STREAM-20 : Avg: 18374.4 TCP_STREAM-40 : Avg: 5599.57 TCP_STREAM-80 : Avg: 2726.41 TCP_RR-1 : Avg: 82638.8 TCP_RR-10 : Avg: 73265.1 TCP_RR-20 : Avg: 52634.5 TCP_RR-40 : Avg: 56302.3 TCP_RR-80 : Avg: 26867.9 UDP_RR-1 : Avg: 107844 UDP_RR-10 : Avg: 95245.2 UDP_RR-20 : Avg: 68673.7 UDP_RR-40 : Avg: 75419.1 UDP_RR-80 : Avg: 35639.1 UDP_STREAM-1 : Avg: 66606 UDP_STREAM-10 : Avg: 52959.5 UDP_STREAM-20 : Avg: 29704 UDP_STREAM-40 : Avg: 15266.5 UDP_STREAM-80 : Avg: 7388.97 [ ok ] Stopping network benchmark server. [....] Starting MySQL database server: mysqldNo directory, logging in with HOME=/. ok 2: [30 secs] transactions: 64277 (2142.51 per sec.) 5: [30 secs] transactions: 144010 (4800.19 per sec.) 10: [30 secs] transactions: 274722 (9157.05 per sec.) 20: [30 secs] transactions: 436325 (14543.55 per sec.) 40: [30 secs] transactions: 665582 (22184.82 per sec.) 80: [30 secs] transactions: 657185 (21904.18 per sec.) [ ok ] Stopping MySQL database server: mysqld. [ ok ] Starting PostgreSQL 9.4 database server: main. 2: [30 secs] transactions: 51153 (1705.06 per sec.) 5: [30 secs] transactions: 116403 (3879.93 per sec.) 10: [30 secs] transactions: 217750 (7258.06 per sec.) 20: [30 secs] transactions: 336619 (11220.00 per sec.) 40: [30 secs] transactions: 520823 (17359.78 per sec.) 80: [30 secs] transactions: 516690 (17221.16 per sec.) [ ok ] Stopping PostgreSQL 9.4 database server: main. Latency percentiles (usec) 50.0000th: 3 75.0000th: 3 90.0000th: 3 95.0000th: 3 *99.0000th: 3 99.5000th: 3 99.9000th: 5 min=0, max=86 avg worker transfer: 185378.92 ops/sec 724.14KB/s rps: 1004.82 p95 (usec) 6136 p99 (usec) 6152 p95/cputime 20.45% p99/cputime 20.51% rps: 1052.51 p95 (usec) 7208 p99 (usec) 7224 p95/cputime 24.03% p99/cputime 24.08% rps: 1076.38 p95 (usec) 7720 p99 (usec) 7736 p95/cputime 25.73% p99/cputime 25.79% rps: 1100.23 p95 (usec) 8208 p99 (usec) 8208 p95/cputime 27.36% p99/cputime 27.36% rps: 1147.89 p95 (usec) 9104 p99 (usec) 9136 p95/cputime 30.35% p99/cputime 30.45% rps: 1171.73 p95 (usec) 9520 p99 (usec) 9552 p95/cputime 31.73% p99/cputime 31.84% rps: 1220.05 p95 (usec) 12336 p99 (usec) 12336 p95/cputime 41.12% p99/cputime 41.12% rps: 1243.85 p95 (usec) 14960 p99 (usec) 14960 p95/cputime 49.87% p99/cputime 49.87% rps: 1243.86 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97% rps: 1266.39 p95 (usec) 213760 p99 (usec) 225024 p95/cputime 712.53% p99/cputime 750.08% Latency percentiles (usec) 50.0000th: 66 75.0000th: 101 90.0000th: 107 95.0000th: 112 *99.0000th: 120 99.5000th: 126 99.9000th: 390 min=0, max=12964 Throughput 678.413 MB/sec 2 clients 2 procs max_latency=0.105 ms Throughput 1589.98 MB/sec 5 clients 5 procs max_latency=0.084 ms Throughput 3012.51 MB/sec 10 clients 10 procs max_latency=0.262 ms Throughput 4555.93 MB/sec 20 clients 20 procs max_latency=0.515 ms Throughput 8496.23 MB/sec 40 clients 40 procs max_latency=2.040 ms Throughput 8601.62 MB/sec 80 clients 80 procs max_latency=2.712 ms --- include/linux/sched/topology.h | 8 --- kernel/sched/fair.c | 131 ++++++++++++----------------------------- kernel/sched/features.h | 2 + 3 files changed, 39 insertions(+), 102 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index d7b6dab956ec..7d065abc7a47 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -71,14 +71,6 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores; - - /* - * Some variables from the most recent sd_lb_stats for this domain, - * used by wake_affine(). - */ - unsigned long nr_running; - unsigned long load; - unsigned long capacity; }; struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 350dbec01523..a1a6b6f52660 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5638,91 +5638,60 @@ static int wake_wide(struct task_struct *p) return 1; } -struct llc_stats { - unsigned long nr_running; - unsigned long load; - unsigned long capacity; - int has_capacity; -}; +/* + * The purpose of wake_affine() is to quickly determine on which CPU we can run + * soonest. For the purpose of speed we only consider the waking and previous + * CPU. + * + * wake_affine_idle() - only considers 'now', it check if the waking CPU is (or + * will be) idle. + * + * wake_affine_weight() - considers the weight to reflect the average + * scheduling latency of the CPUs. This seems to work + * for the overloaded case. + */ -static bool get_llc_stats(struct llc_stats *stats, int cpu) +static bool +wake_affine_idle(struct sched_domain *sd, struct task_struct *p, + int this_cpu, int prev_cpu, int sync) { - struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); - - if (!sds) - return false; + if (idle_cpu(this_cpu)) + return true; - stats->nr_running = READ_ONCE(sds->nr_running); - stats->load = READ_ONCE(sds->load); - stats->capacity = READ_ONCE(sds->capacity); - stats->has_capacity = stats->nr_running < per_cpu(sd_llc_size, cpu); + if (sync && cpu_rq(this_cpu)->nr_running == 1) + return true; - return true; + return false; } -/* - * Can a task be moved from prev_cpu to this_cpu without causing a load - * imbalance that would trigger the load balancer? - * - * Since we're running on 'stale' values, we might in fact create an imbalance - * but recomputing these values is expensive, as that'd mean iteration 2 cache - * domains worth of CPUs. - */ static bool -wake_affine_llc(struct sched_domain *sd, struct task_struct *p, - int this_cpu, int prev_cpu, int sync) +wake_affine_weight(struct sched_domain *sd, struct task_struct *p, + int this_cpu, int prev_cpu, int sync) { - struct llc_stats prev_stats, this_stats; s64 this_eff_load, prev_eff_load; unsigned long task_load; - if (!get_llc_stats(&prev_stats, prev_cpu) || - !get_llc_stats(&this_stats, this_cpu)) - return false; + this_eff_load = target_load(this_cpu, sd->wake_idx); + prev_eff_load = source_load(prev_cpu, sd->wake_idx); - /* - * If sync wakeup then subtract the (maximum possible) - * effect of the currently running task from the load - * of the current LLC. - */ if (sync) { unsigned long current_load = task_h_load(current); - /* in this case load hits 0 and this LLC is considered 'idle' */ - if (current_load > this_stats.load) + if (current_load > this_eff_load) return true; - this_stats.load -= current_load; + this_eff_load -= current_load; } - /* - * The has_capacity stuff is not SMT aware, but by trying to balance - * the nr_running on both ends we try and fill the domain at equal - * rates, thereby first consuming cores before siblings. - */ - - /* if the old cache has capacity, stay there */ - if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1) - return false; - - /* if this cache has capacity, come here */ - if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running) - return true; - - /* - * Check to see if we can move the load without causing too much - * imbalance. - */ task_load = task_h_load(p); - this_eff_load = 100; - this_eff_load *= prev_stats.capacity; - - prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; - prev_eff_load *= this_stats.capacity; + this_eff_load += task_load; + this_eff_load *= 100; + this_eff_load *= capacity_of(prev_cpu); - this_eff_load *= this_stats.load + task_load; - prev_eff_load *= prev_stats.load - task_load; + prev_eff_load -= task_load; + prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2; + prev_eff_load *= capacity_of(this_cpu); return this_eff_load <= prev_eff_load; } @@ -5731,22 +5700,13 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int prev_cpu, int sync) { int this_cpu = smp_processor_id(); - bool affine; + bool affine = false; - /* - * Default to no affine wakeups; wake_affine() should not effect a task - * placement the load-balancer feels inclined to undo. The conservative - * option is therefore to not move tasks when they wake up. - */ - affine = false; + if (sched_feat(WA_IDLE) && !affine) + affine = wake_affine_idle(sd, p, this_cpu, prev_cpu, sync); - /* - * If the wakeup is across cache domains, try to evaluate if movement - * makes sense, otherwise rely on select_idle_siblings() to do - * placement inside the cache domain. - */ - if (!cpus_share_cache(prev_cpu, this_cpu)) - affine = wake_affine_llc(sd, p, this_cpu, prev_cpu, sync); + if (sched_feat(WA_WEIGHT) && !affine) + affine = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync); schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts); if (affine) { @@ -7895,7 +7855,6 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) */ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) { - struct sched_domain_shared *shared = env->sd->shared; struct sched_domain *child = env->sd->child; struct sched_group *sg = env->sd->groups; struct sg_lb_stats *local = &sds->local_stat; @@ -7967,22 +7926,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload; } - - if (!shared) - return; - - /* - * Since these are sums over groups they can contain some CPUs - * multiple times for the NUMA domains. - * - * Currently only wake_affine_llc() and find_busiest_group() - * uses these numbers, only the last is affected by this problem. - * - * XXX fix that. - */ - WRITE_ONCE(shared->nr_running, sds->total_running); - WRITE_ONCE(shared->load, sds->total_load); - WRITE_ONCE(shared->capacity, sds->total_capacity); } /** diff --git a/kernel/sched/features.h b/kernel/sched/features.h index d3fb15555291..d40d33ec935f 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -81,3 +81,5 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true) SCHED_FEAT(LB_MIN, false) SCHED_FEAT(ATTACH_AGE_LOAD, true) +SCHED_FEAT(WA_IDLE, true) +SCHED_FEAT(WA_WEIGHT, false)