From patchwork Thu Apr  6 08:16:47 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dario Faggioli <dario.faggioli@citrix.com>
X-Patchwork-Id: 9666337
Return-Path: <xen-devel-bounces@lists.xen.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	8E21760364 for <patchwork-xen-devel@patchwork.kernel.org>;
	Thu,  6 Apr 2017 08:18:52 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7C38A28485
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Thu,  6 Apr 2017 08:18:52 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 7110E2849F; Thu,  6 Apr 2017 08:18:52 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.6 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_MED,RCVD_IN_SORBS_SPAM,T_DKIM_INVALID autolearn=ham
	version=3.3.1
Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120])
	(using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id A1F8828485
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Thu,  6 Apr 2017 08:18:51 +0000 (UTC)
Received: from localhost ([127.0.0.1] helo=lists.xenproject.org)
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <xen-devel-bounces@lists.xen.org>)
	id 1cw2b6-0000iX-8T; Thu, 06 Apr 2017 08:16:52 +0000
Received: from mail6.bemta6.messagelabs.com ([193.109.254.103])
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <raistlin.df@gmail.com>) id 1cw2b4-0000hB-Pn
	for xen-devel@lists.xenproject.org; Thu, 06 Apr 2017 08:16:50 +0000
Received: from [193.109.254.147] by server-10.bemta-6.messagelabs.com id
	B7/06-13192-279F5E85; Thu, 06 Apr 2017 08:16:50 +0000
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFmpileJIrShJLcpLzFFi42K5GNpwWLfw59M
	Ig+dTpC2+b5nM5MDocfjDFZYAxijWzLyk/IoE1oxTN9UKPrlVbDz0gL2B8ZpRFyMXh5DAdEaJ
	K6t7GEEcFoE1rBJnjlxiAXEkBC6xSpy685qti5ETyImR6Pv/iBnCrpToaD/GDmILCahI3Ny+i
	gnC/s0o0bJJHcQWFtCTOHL0BzuEHSvxd/kLsDlsAgYSb3bsZQWxRQSUJO6tmswEsoxZoI1RYt
	rPt2CDWARUJS5OmgbWzCvgI3Hj2VIgm4ODE8he0SwOYgoJeEvMfV0AUiEqICex8nILK0S1oMT
	JmU9YQEqYBTQl1u/SBwkzC8hLbH87h3kCo8gsJFWzEKpmIalawMi8ilGjOLWoLLVI19BEL6ko
	Mz2jJDcxM0fX0MBMLze1uDgxPTUnMalYLzk/dxMjMPQZgGAH4/WNAYcYJTmYlER5FXyeRAjxJ
	eWnVGYkFmfEF5XmpBYfYtTg4BCYcHbudCYplrz8vFQlCd5r359GCAkWpaanVqRl5gCjE6ZUgo
	NHSYTX9wdQmre4IDG3ODMdInWK0ZjjxeX375k4nqz88Z5JCGySlDjvF5BJAiClGaV5cINgSeM
	So6yUMC8j0JlCPAWpRbmZJajyrxjFORiVhHmlQBbyZOaVwO17BXQKE9ApPrfATilJREhJNTAG
	mpjo3ombHii+Y2ntGy4tuWu8MYVF5sVJgqUNVoeaWty3b7q9PSx6reYVt3M3n728JGOR/lKlp
	P5F4oq3RX0vOv1/PDzyQmhrEWOEVr8nm1mE3e3sUyxJbaGSHxQTm1KdZ0zhnXp9md2BplmxE9
	hSTjK1bfZ8skMheIL/rkscfL4Ld4Yse63EUpyRaKjFXFScCAA9+3TxFQMAAA==
X-Env-Sender: raistlin.df@gmail.com
X-Msg-Ref: server-14.tower-27.messagelabs.com!1491466609!83550394!1
X-Originating-IP: [209.85.128.195]
X-SpamReason: No, hits=0.0 required=7.0 tests=
X-StarScan-Received: 
X-StarScan-Version: 9.4.12; banners=-,-,-
X-VirusChecked: Checked
Received: (qmail 25064 invoked from network); 6 Apr 2017 08:16:49 -0000
Received: from mail-wr0-f195.google.com (HELO mail-wr0-f195.google.com)
	(209.85.128.195)
	by server-14.tower-27.messagelabs.com with AES128-GCM-SHA256
	encrypted SMTP; 6 Apr 2017 08:16:49 -0000
Received: by mail-wr0-f195.google.com with SMTP id u18so989485wrc.1
	for <xen-devel@lists.xenproject.org>;
	Thu, 06 Apr 2017 01:16:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=sender:subject:from:to:cc:date:message-id:in-reply-to:references
	:user-agent:mime-version:content-transfer-encoding;
	bh=1x8GZzWUMnwzNiGdFkv2oQFPhatUhP49PT+JSuGCppg=;
	b=M5RDWu9OUnITZZr/mFmU75YfTCZoA0YCcEJEC3H/62d70Ei/1LXYAqDFk5tLIddh9m
	gNxv9HbR6qke1Lj1kq/94gIc1jCFt6DyeVQa5Er+NOLaGSS2TngHZ5MNh1u/uHrVXy4B
	Y98ePn6WqANwG8MJVhqzXo4HsVr4CzcPzSUlPy4IhHpdBA4135Fil1pJ1VlsdFp+mEMZ
	N0JNmwVVsD+uQ+ZpMiElLD6DpP7nEyoZUM5CUsXG11jwKoMb1LmZYWHowinoxEWrc1kJ
	fVA+uZeFfzkGEdAZzLMQzjAI3y4nj4cfK3PvwVP0gfF66NzDiVH3LYsW96/0y7rc4jb7
	vrYA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:sender:subject:from:to:cc:date:message-id
	:in-reply-to:references:user-agent:mime-version
	:content-transfer-encoding;
	bh=1x8GZzWUMnwzNiGdFkv2oQFPhatUhP49PT+JSuGCppg=;
	b=BF4xRMPn5vP628i53bz5rwUFqMgTjPfGjluooRh69J7eEfca7jrH2HIUNSQ2OESVHR
	/pCtlNuiCMwPMFtQcFk4w0uYaJUtPOWPf10T5qEpFyNBSYoIscCxCvJV9gploOTBk/R9
	zgJjavTZHcFbHuhFpYZ0BZGhxQqxWXu7o/z+gCF3hRrUsXwQUEO3Ii5QDA2je6hRNu7u
	WLYAmg5actw55Ih5sutU4KeRry6JrMVVJGSFTAzF377fLHwUN7F8BzSfvuFYDbX7G/PF
	ILQBEkUSUFAPdN6IZWBnGAfIrgYG342L0LoDMbLxO9Kx2KnRpJ8i8S12t0x1ydO60j/b
	RGkQ==
X-Gm-Message-State: 
 AFeK/H0bpdrEsPmvvpqRFDbnKJjxbGftZiTthtSXK1xgx5mK3PoBv+XuTPJNF0Rvj2IhoA==
X-Received: by 10.223.153.18 with SMTP id x18mr6448498wrb.55.1491466609020;
	Thu, 06 Apr 2017 01:16:49 -0700 (PDT)
Received: from Solace.fritz.box ([80.66.223.217])
	by smtp.gmail.com with ESMTPSA id
	w186sm1434426wme.26.2017.04.06.01.16.47
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Thu, 06 Apr 2017 01:16:48 -0700 (PDT)
From: Dario Faggioli <dario.faggioli@citrix.com>
To: xen-devel@lists.xenproject.org
Date: Thu, 06 Apr 2017 10:16:47 +0200
Message-ID: <149146660655.21348.13071925386790562321.stgit@Solace.fritz.box>
In-Reply-To: <149146456487.21348.8554211499146017782.stgit@Solace.fritz.box>
References: <149146456487.21348.8554211499146017782.stgit@Solace.fritz.box>
User-Agent: StGit/0.17.1-dirty
MIME-Version: 1.0
Cc: George Dunlap <george.dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Anshul Makkar <anshul.makkar@citrix.com>
Subject: [Xen-devel] [PATCH v2 5/7] xen: credit1: increase efficiency and
	scalability of load balancing.
X-BeenThere: xen-devel@lists.xen.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Xen developer discussion <xen-devel.lists.xen.org>
List-Unsubscribe: <https://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <https://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
X-Virus-Scanned: ClamAV using ClamSMTP

During load balancing, we check the non idle pCPUs to
see if they have runnable but not running vCPUs that
can be stolen by and set to run on currently idle pCPUs.

If a pCPU has only one running (or runnable) vCPU,
though, we don't want to steal it from there, and
it's therefore pointless bothering with it
(especially considering that bothering means trying
to take its runqueue lock!).

On large systems, when load is only slightly higher
than the number of pCPUs (i.e., there are just a few
more active vCPUs than the number of the pCPUs), this
may mean that:
 - we go through all the pCPUs,
 - for each one, we (try to) take its runqueue locks,
 - we figure out there's actually nothing to be stolen!

To mitigate this, we introduce a counter for the number
of runnable vCPUs on each pCPU. In fact, unless there
re least 2 runnable vCPUs --typically, one running,
and the others in the runqueue-- it does not make sense
to try stealing anything.

signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
Changes from v2:
* don't count the idle vCPU as runnable. This is just cosmetic and not at all a
  logic or functionl change wrt v1;
* don't count inside of __runq_remove() or __runq_insert(), but provide
  specific counting functions, and call them when appropriate. This is necessary
  to avoid spurious overloaded state to be reported, basically *all* the time
  that a pCPU goes through the scheduler (due to the fact that the scheduler
  calls __runq_insert() on the current vCPU);
* get rid of the overloaded cpumask and only use the counter. I actually did
  like the cpumask solution better, but for this purpose, it was overkill.
---
 xen/common/sched_credit.c |   78 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 72 insertions(+), 6 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 49caa0a..09e3192 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -172,6 +172,7 @@ struct csched_pcpu {
     struct timer ticker;
     unsigned int tick;
     unsigned int idle_bias;
+    unsigned int nr_runnable;
 };
 
 /*
@@ -262,9 +263,26 @@ static inline bool_t is_runq_idle(unsigned int cpu)
 }
 
 static inline void
+inc_nr_runnable(unsigned int cpu)
+{
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
+    CSCHED_PCPU(cpu)->nr_runnable++;
+
+}
+
+static inline void
+dec_nr_runnable(unsigned int cpu)
+{
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
+    CSCHED_PCPU(cpu)->nr_runnable--;
+    ASSERT(CSCHED_PCPU(cpu)->nr_runnable >= 0);
+}
+
+static inline void
 __runq_insert(struct csched_vcpu *svc)
 {
-    const struct list_head * const runq = RUNQ(svc->vcpu->processor);
+    unsigned int cpu = svc->vcpu->processor;
+    const struct list_head * const runq = RUNQ(cpu);
     struct list_head *iter;
 
     BUG_ON( __vcpu_on_runq(svc) );
@@ -601,6 +619,7 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
     /* Start off idling... */
     BUG_ON(!is_idle_vcpu(curr_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
+    spc->nr_runnable = 0;
 }
 
 static void
@@ -1042,7 +1061,10 @@ csched_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
     lock = vcpu_schedule_lock_irq(vc);
 
     if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc->is_running )
+    {
         __runq_insert(svc);
+        inc_nr_runnable(vc->processor);
+    }
 
     vcpu_schedule_unlock_irq(lock, vc);
 
@@ -1102,7 +1124,10 @@ csched_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
         cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
     }
     else if ( __vcpu_on_runq(svc) )
+    {
+        dec_nr_runnable(cpu);
         __runq_remove(svc);
+    }
 }
 
 static void
@@ -1163,6 +1188,7 @@ csched_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
 
     /* Put the VCPU on the runq and tickle CPUs */
     __runq_insert(svc);
+    inc_nr_runnable(vc->processor);
     __runq_tickle(svc);
 }
 
@@ -1664,8 +1690,10 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
             SCHED_VCPU_STAT_CRANK(speer, migrate_q);
             SCHED_STAT_CRANK(migrate_queued);
             WARN_ON(vc->is_urgent);
+            dec_nr_runnable(peer_cpu);
             __runq_remove(speer);
             vc->processor = cpu;
+            inc_nr_runnable(cpu);
             return speer;
         }
     }
@@ -1721,7 +1749,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
         peer_node = node;
         do
         {
-            /* Find out what the !idle are in this node */
+            /* Select the pCPUs in this node that have work we can steal. */
             cpumask_andnot(&workers, online, prv->idlers);
             cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
             __cpumask_clear_cpu(cpu, &workers);
@@ -1731,6 +1759,40 @@ csched_load_balance(struct csched_private *prv, int cpu,
                 goto next_node;
             do
             {
+                spinlock_t *lock;
+
+                /*
+                 * If there is only one runnable vCPU on peer_cpu, it means
+                 * there's no one to be stolen in its runqueue, so skip it.
+                 *
+                 * Checking this without holding the lock is racy... But that's
+                 * the whole point of this optimization!
+                 *
+                 * In more details:
+                 * - if we race with dec_nr_runnable(), we may try to take the
+                 *   lock and call csched_runq_steal() for no reason. This is
+                 *   not a functional issue, and should be infrequent enough.
+                 *   And we can avoid that by re-checking nr_runnable after
+                 *   having grabbed the lock, if we want;
+                 * - if we race with inc_nr_runnable(), we skip a pCPU that may
+                 *   have runnable vCPUs in its runqueue, but that's not a
+                 *   problem because:
+                 *   + if racing with csched_vcpu_insert() or csched_vcpu_wake(),
+                 *     __runq_tickle() will be called afterwords, so the vCPU
+                 *     won't get stuck in the runqueue for too long;
+                 *   + if racing with csched_runq_steal(), it may be that a
+                 *     vCPU that we could have picked up, stays in a runqueue
+                 *     until someone else tries to steal it again. But this is
+                 *     no worse than what can happen already (without this
+                 *     optimization), it the pCPU would schedule right after we
+                 *     have taken the lock, and hence block on it.
+                 */
+                if ( CSCHED_PCPU(peer_cpu)->nr_runnable <= 1 )
+                {
+                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipp'n */ 0);
+                    goto next_cpu;
+                }
+
                 /*
                  * Get ahold of the scheduler lock for this peer CPU.
                  *
@@ -1738,14 +1800,13 @@ csched_load_balance(struct csched_private *prv, int cpu,
                  * could cause a deadlock if the peer CPU is also load
                  * balancing and trying to lock this CPU.
                  */
-                spinlock_t *lock = pcpu_schedule_trylock(peer_cpu);
+                lock = pcpu_schedule_trylock(peer_cpu);
                 SCHED_STAT_CRANK(steal_trylock);
                 if ( !lock )
                 {
                     SCHED_STAT_CRANK(steal_trylock_failed);
                     TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipp'n */ 0);
-                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
-                    continue;
+                    goto next_cpu;
                 }
 
                 TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* checked */ 1);
@@ -1762,6 +1823,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
                     return speer;
                 }
 
+ next_cpu:
                 peer_cpu = cpumask_cycle(peer_cpu, &workers);
 
             } while( peer_cpu != cpumask_first(&workers) );
@@ -1892,7 +1954,10 @@ csched_schedule(
     if ( vcpu_runnable(current) )
         __runq_insert(scurr);
     else
+    {
         BUG_ON( is_idle_vcpu(current) || list_empty(runq) );
+        dec_nr_runnable(cpu);
+    }
 
     snext = __runq_elem(runq->next);
     ret.migrated = 0;
@@ -2009,7 +2074,8 @@ csched_dump_pcpu(const struct scheduler *ops, int cpu)
     runq = &spc->runq;
 
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_mask, cpu));
-    printk("CPU[%02d] sort=%d, sibling=%s, ", cpu, spc->runq_sort_last, cpustr);
+    printk("CPU[%02d] nr_run=%d, sort=%d, sibling=%s, ",
+           cpu, spc->nr_runnable, spc->runq_sort_last, cpustr);
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_mask, cpu));
     printk("core=%s\n", cpustr);