From patchwork Fri Jun 17 23:13:29 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dario Faggioli <dario.faggioli@citrix.com>
X-Patchwork-Id: 9185163
Return-Path: <xen-devel-bounces@lists.xen.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	90BAA601C0 for <patchwork-xen-devel@patchwork.kernel.org>;
	Fri, 17 Jun 2016 23:15:28 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7EB6627BE5
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Fri, 17 Jun 2016 23:15:28 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 7376127F07; Fri, 17 Jun 2016 23:15:28 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-4.1 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_MED,T_DKIM_INVALID autolearn=ham version=3.3.1
Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120])
	(using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id AE94A27BE5
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Fri, 17 Jun 2016 23:15:27 +0000 (UTC)
Received: from localhost ([127.0.0.1] helo=lists.xenproject.org)
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <xen-devel-bounces@lists.xen.org>)
	id 1bE2xE-0004Mf-A3; Fri, 17 Jun 2016 23:13:36 +0000
Received: from mail6.bemta3.messagelabs.com ([195.245.230.39])
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <raistlin.df@gmail.com>) id 1bE2xC-0004KM-HF
	for xen-devel@lists.xenproject.org; Fri, 17 Jun 2016 23:13:34 +0000
Received: from [85.158.137.68] by server-15.bemta-3.messagelabs.com id
	EC/79-30131-D1484675; Fri, 17 Jun 2016 23:13:33 +0000
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrHIsWRWlGSWpSXmKPExsXiVRvkoivTkhJ
	ucHabkcX3LZOZHBg9Dn+4whLAGMWamZeUX5HAmtH3/ABLwe+wiiW7nBsYG5y7GLk4hASmM0pM
	vLyFGcRhEVjDKjHnzURGEEdC4BKrxNNly9i6GDmBnBiJCbcesUPYFRKX/0xkBbGFBFQkbm5fx
	QQxagGTxMIHC5lAEsICehJHjv5gh7AjJA7fPAQWZxMwkHizYy9Ys4iAksS9VZPBmpkFGhglJj
	zeBVbEIqAq0XnpL9hmXgEfib2/L4DFOQV8JVZf2cUGsdlH4uK5R2C2qICcxMrLLawQ9YISJ2c
	+Yeli5AAaqimxfpc+SJhZQF5i+9s5zBMYRWYhqZqFUDULSdUCRuZVjOrFqUVlqUW6xnpJRZnp
	GSW5iZk5uoYGxnq5qcXFiempOYlJxXrJ+bmbGIHhzwAEOxibvzgdYpTkYFIS5b1SmRIuxJeUn
	1KZkVicEV9UmpNafIhRg4NDYMLZudOZpFjy8vNSlSR4NZuB6gSLUtNTK9Iyc4ARClMqwcGjJM
	JrCJLmLS5IzC3OTIdInWLU5dgy9d5aJiGwGVLivCogRQIgRRmleXAjYMniEqOslDAvI9CBQjw
	FqUW5mSWo8q8YxTkYlYR5BUGm8GTmlcBtegV0BBPQEZrzkkGOKElESEk1MFpoVE47K8+VZtPD
	PX9pw7Pvk9i/HVdOyjdbzsfyMaJcdiaD2TX9I5P63unu3vbn1eX6F7UL45NkffQk1iqsZnlxQ
	va0826zOm3+xNk6xTsMJG3+ds8vnWC55lJ90LlzJYeTJuY+1ze6lVD1MUh02pLeP9+9lCJ9lT
	Rdm6ccd37UsHom256Li5RYijMSDbWYi4oTAYoaLfsRAwAA
X-Env-Sender: raistlin.df@gmail.com
X-Msg-Ref: server-2.tower-31.messagelabs.com!1466205212!45986738!1
X-Originating-IP: [74.125.82.68]
X-SpamReason: No, hits=0.0 required=7.0 tests=
X-StarScan-Received: 
X-StarScan-Version: 8.46; banners=-,-,-
X-VirusChecked: Checked
Received: (qmail 31275 invoked from network); 17 Jun 2016 23:13:32 -0000
Received: from mail-wm0-f68.google.com (HELO mail-wm0-f68.google.com)
	(74.125.82.68)
	by server-2.tower-31.messagelabs.com with AES128-GCM-SHA256 encrypted
	SMTP; 17 Jun 2016 23:13:32 -0000
Received: by mail-wm0-f68.google.com with SMTP id 187so1095596wmz.1
	for <xen-devel@lists.xenproject.org>;
	Fri, 17 Jun 2016 16:13:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=sender:subject:from:to:cc:date:message-id:in-reply-to:references
	:user-agent:mime-version:content-transfer-encoding;
	bh=OnM1br/6QKr9wctRwo+bHUp970TJcNmNvpvA34qbd4M=;
	b=jcyRjhU1Ri5iAkC3BQJN8BIUm//iKPb8NnnVwvf/shefW7ddO/vfJSsPA0FZF4yQ9m
	rLSpbpSsYaGQtLSjAPdGZZFHdRAlUj86pR6MYHZ8hR2O9QRlnC26YFKKDb/ZRMVuP5Yw
	LcVI4GFcu1u9Qnh5iptAx80qJ5lxdmCQNPxZwwOMc3wzZ1HEYRThyZfstRT780Wp2naz
	k3+jVAizAwMZCvvynylsdHbQa+wXzsuhFxK0bQL1YR/o9VzogBp7ho3m7wQxhNqxTXlC
	INQibLD72BLWMdTDrBXZtk7bg00b/CNVp62EGQuxr5lxfrAiBFhFCvb/bDxanpOpCAYn
	OCMA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:sender:subject:from:to:cc:date:message-id
	:in-reply-to:references:user-agent:mime-version
	:content-transfer-encoding;
	bh=OnM1br/6QKr9wctRwo+bHUp970TJcNmNvpvA34qbd4M=;
	b=aOWEngKGmF0yDWx9K2sOig+ZpeTvZi0LEtDBbjpgKQqNFvIpElqm81MuyjMo96G8H3
	79EVJ9o9QCPARylm6YVxR4RDetZ3D/nvXn0gATJ0nLCaCTPE38k1J28dWlq1MHrU67tx
	Y7SxmTD9tIemjexZ26LPECgBKv8yONvhtheyVFRCAgju1ATZCo3TQF8NqpVWPYK+h3DS
	javOuCedbjguO2c4d+8MkTgzziY0/sFs9WWS3Sk23HeD6DC+BHcxP28pKl0VuZMXF0fe
	LNO1EiYoQenoO+jiNWnl0k8M5mu8tLj9JGkUB02k022Wl2UBgoy8q+3OpB2zEORCOdUn
	DgNA==
X-Gm-Message-State: 
 ALyK8tIUHU5lnz2406CcAy2HvJpaZ3e3sm3Fpq/yeuEb+9sm5Xb8pkjmkXjIGdKThN+tQQ==
X-Received: by 10.28.27.212 with SMTP id b203mr676784wmb.19.1466205212258;
	Fri, 17 Jun 2016 16:13:32 -0700 (PDT)
Received: from Solace.fritz.box (net-93-65-149-191.cust.vodafonedsl.it.
	[93.65.149.191]) by smtp.gmail.com with ESMTPSA id
	p76sm865773wmd.10.2016.06.17.16.13.30
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Fri, 17 Jun 2016 16:13:31 -0700 (PDT)
From: Dario Faggioli <dario.faggioli@citrix.com>
To: xen-devel@lists.xenproject.org
Date: Sat, 18 Jun 2016 01:13:29 +0200
Message-ID: <146620520979.29766.17431818083809592415.stgit@Solace.fritz.box>
In-Reply-To: <146620492155.29766.10321123657058307698.stgit@Solace.fritz.box>
References: <146620492155.29766.10321123657058307698.stgit@Solace.fritz.box>
User-Agent: StGit/0.17.1-dirty
MIME-Version: 1.0
Cc: Anshul Makkar <anshul.makkar@citrix.com>,
	George Dunlap <george.dunlap@citrix.com>,
	David Vrabel <david.vrabel@citrix.com>
Subject: [Xen-devel] [PATCH 18/19] xen: credit2: implement SMT support
	independent runq arrangement
X-BeenThere: xen-devel@lists.xen.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Xen developer discussion <xen-devel.lists.xen.org>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
X-Virus-Scanned: ClamAV using ClamSMTP

In fact, right now, we recommend keepeing runqueues
arranged per-core, so that it is the inter-runqueue load
balancing code that automatically spreads the work in an
SMT friendly way. This means that any other runq
arrangement one may want to use falls short of SMT
scheduling optimizations.

This commit implements SMT awareness --similar to the
one we have in Credit1-- for any possible runq
arrangement. This turned out to be pretty easy to do,
as the logic can live entirely in runq_tickle()
(although, in order to avoid for_each_cpu loops in
that function, we use a new cpumask which indeed needs
to be updated in other places).

In addition to disentangling SMT awareness from load
balancing, this also allows us to support the
sched_smt_power_savings parametar in Credit2 as well.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Anshul Makkar <anshul.makkar@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |  141 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 126 insertions(+), 15 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 93943fa..a8b3a85 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -351,7 +351,8 @@ struct csched2_runqueue_data {
     unsigned int max_weight;
 
     cpumask_t idle,        /* Currently idle */
-        tickled;           /* Another cpu in the queue is already targeted for this one */
+        smt_idle,          /* Fully idle cores (as in all the siblings are idle) */
+        tickled;           /* Have been asked to go through schedule */
     int load;              /* Instantaneous load: Length of queue  + num non-idle threads */
     s_time_t load_last_update;  /* Last time average was updated */
     s_time_t avgload;           /* Decaying queue load */
@@ -412,6 +413,73 @@ struct csched2_dom {
 };
 
 /*
+ * Hyperthreading (SMT) support.
+ *
+ * We use a special per-runq mask (smt_idle) and update it according to the
+ * following logic:
+ *  - when _all_ the SMT sibling in a core are idle, all their corresponding
+ *    bits are set in the smt_idle mask;
+ *  - when even _just_one_ of the SMT siblings in a core is not idle, all the
+ *    bits correspondings to it and to all its siblings are clear in the
+ *    smt_idle mask.
+ *
+ * Once we have such a mask, it is easy to implement a policy that, either:
+ *  - uses fully idle cores first: it is enough to try to schedule the vcpus
+ *    on pcpus from smt_idle mask first. This is what happens if
+ *    sched_smt_power_savings was not set at boot (default), and it maximizes
+ *    true parallelism, and hence performance;
+ *  - uses already busy cores first: it is enough to try to schedule the vcpus
+ *    on pcpus that are idle, but are not in smt_idle. This is what happens if
+ *    sched_smt_power_savings is set at boot, and it allows as more cores as
+ *    possible to stay in low power states, minimizing power consumption.
+ *
+ * This logic is entirely implemented in runq_tickle(), and that is enough.
+ * In fact, in this scheduler, placement of a vcpu on one of the pcpus of a
+ * runq, _always_ happens by means of tickling:
+ *  - when a vcpu wakes up, it calls csched2_vcpu_wake(), which calls
+ *    runq_tickle();
+ *  - when a migration is initiated in schedule.c, we call csched2_cpu_pick(),
+ *    csched2_vcpu_migrate() (which calls migrate()) and csched2_vcpu_wake().
+ *    csched2_cpu_pick() looks for the least loaded runq and return just any
+ *    of its processors. Then, csched2_vcpu_migrate() just moves the vcpu to
+ *    the chosen runq, and it is again runq_tickle(), called by
+ *    csched2_vcpu_wake() that actually decides what pcpu to use within the
+ *    chosen runq;
+ *  - when a migration is initiated in sched_credit2.c, by calling  migrate()
+ *    directly, that again temporarily use a random pcpu from the new runq,
+ *    and then calls runq_tickle(), by itself.
+ */
+
+/*
+ * If all the siblings of cpu (including cpu itself) are in idlers,
+ * set all their bits in mask.
+ *
+ * In order to properly take into account tickling, idlers needs to be
+ * set qeual to something like:
+ *
+ *  rqd->idle & (~rqd->tickled)
+ *
+ * This is because cpus that have been tickled will very likely pick up some
+ * work as soon as the manage to schedule, and hence we should really consider
+ * them as busy.
+ */
+static inline
+void smt_idle_mask_set(unsigned int cpu, cpumask_t *idlers, cpumask_t *mask)
+{
+    if ( cpumask_subset( per_cpu(cpu_sibling_mask, cpu), idlers) )
+        cpumask_or(mask, mask, per_cpu(cpu_sibling_mask, cpu));
+}
+
+/*
+ * Clear the bits of all the siblings of cpu from mask.
+ */
+static inline
+void smt_idle_mask_clear(unsigned int cpu, cpumask_t *mask)
+{
+    cpumask_andnot(mask, mask, per_cpu(cpu_sibling_mask, cpu));
+}
+
+/*
  * When a hard affinity change occurs, we may not be able to check some
  * (any!) of the other runqueues, when looking for the best new processor
  * for svc (as trylock-s in csched2_cpu_pick() can fail). If that happens, we
@@ -851,9 +919,30 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
     }
 
     /*
-     * Get a mask of idle, but not tickled, processors that new is
-     * allowed to run on. If that's not empty, choose someone from there
-     * (preferrably, the one were new was running on already).
+     * First of all, consider idle cpus, checking if we can just
+     * re-use the pcpu where we were running before.
+     *
+     * If there are cores where all the siblings are idle, consider
+     * them first, honoring whatever the spreading-vs-consolidation
+     * SMT policy wants us to do.
+     */
+    if ( unlikely(sched_smt_power_savings) )
+        cpumask_andnot(&mask, &rqd->idle, &rqd->smt_idle);
+    else
+        cpumask_copy(&mask, &rqd->smt_idle);
+    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
+    i = cpumask_test_or_cycle(cpu, &mask);
+    if ( i < nr_cpu_ids )
+    {
+        SCHED_STAT_CRANK(tickled_idle_cpu);
+        ipid = i;
+        goto tickle;
+    }
+
+    /*
+     * If there are no fully idle cores, check all idlers, after
+     * having filtered out pcpus that have been tickled but haven't
+     * gone through the scheduler yet.
      */
     cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
     cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
@@ -945,6 +1034,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
                     (unsigned char *)&d);
     }
     __cpumask_set_cpu(ipid, &rqd->tickled);
+    //smt_idle_mask_clear(ipid, &rqd->smt_idle); XXX
     cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
 }
 
@@ -1435,13 +1525,15 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     if ( !read_trylock(&prv->lock) )
     {
-        /* We may be here because someon requested us to migrate */
+        /* We may be here because someone requested us to migrate */
         __clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
         return get_fallback_cpu(svc);
     }
 
-    /* First check to see if we're here because someone else suggested a place
-     * for us to move. */
+    /*
+     * First check to see if we're here because someone else suggested a place
+     * for us to move.
+     */
     if ( __test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
     {
         if ( unlikely(svc->migrate_rqd->id < 0) )
@@ -1462,7 +1554,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     min_avgload = MAX_LOAD;
 
-    /* Find the runqueue with the lowest instantaneous load */
+    /* Find the runqueue with the lowest average load. */
     for_each_cpu(i, &prv->active_queues)
     {
         struct csched2_runqueue_data *rqd;
@@ -1505,16 +1597,17 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     /* We didn't find anyone (most likely because of spinlock contention). */
     if ( min_rqi == -1 )
-        new_cpu = get_fallback_cpu(svc);
-    else
     {
-        cpumask_and(cpumask_scratch, vc->cpu_hard_affinity,
-                    &prv->rqd[min_rqi].active);
-        new_cpu = cpumask_any(cpumask_scratch);
-        BUG_ON(new_cpu >= nr_cpu_ids);
+        new_cpu = get_fallback_cpu(svc);
+        goto out_up;
     }
 
-out_up:
+    cpumask_and(cpumask_scratch, vc->cpu_hard_affinity,
+                &prv->rqd[min_rqi].active);
+    new_cpu = cpumask_any(cpumask_scratch);
+    BUG_ON(new_cpu >= nr_cpu_ids);
+
+ out_up:
     read_unlock(&prv->lock);
 
     if ( unlikely(tb_init_done) )
@@ -2166,7 +2259,11 @@ csched2_schedule(
 
     /* Clear "tickled" bit now that we've been scheduled */
     if ( cpumask_test_cpu(cpu, &rqd->tickled) )
+    {
         __cpumask_clear_cpu(cpu, &rqd->tickled);
+        cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled);
+        smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle); // XXX
+    }
 
     /* Update credits */
     burn_credits(rqd, scurr, now);
@@ -2228,7 +2325,10 @@ csched2_schedule(
 
         /* Clear the idle mask if necessary */
         if ( cpumask_test_cpu(cpu, &rqd->idle) )
+        {
             __cpumask_clear_cpu(cpu, &rqd->idle);
+            smt_idle_mask_clear(cpu, &rqd->smt_idle);
+        }
 
         snext->start_time = now;
 
@@ -2250,10 +2350,17 @@ csched2_schedule(
         if ( tasklet_work_scheduled )
         {
             if ( cpumask_test_cpu(cpu, &rqd->idle) )
+            {
                 __cpumask_clear_cpu(cpu, &rqd->idle);
+                smt_idle_mask_clear(cpu, &rqd->smt_idle);
+            }
         }
         else if ( !cpumask_test_cpu(cpu, &rqd->idle) )
+        {
             __cpumask_set_cpu(cpu, &rqd->idle);
+            cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled);
+            smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle);
+        }
         /* Make sure avgload gets updated periodically even
          * if there's no activity */
         update_load(ops, rqd, NULL, 0, now);
@@ -2383,6 +2490,8 @@ csched2_dump(const struct scheduler *ops)
         printk("\tidlers: %s\n", cpustr);
         cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].tickled);
         printk("\ttickled: %s\n", cpustr);
+        cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].smt_idle);
+        printk("\tfully idle cores: %s\n", cpustr);
     }
 
     printk("Domain info:\n");
@@ -2536,6 +2645,7 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
     __cpumask_set_cpu(cpu, &rqd->idle);
     __cpumask_set_cpu(cpu, &rqd->active);
     __cpumask_set_cpu(cpu, &prv->initialized);
+    __cpumask_set_cpu(cpu, &rqd->smt_idle);
 
     return rqi;
 }
@@ -2641,6 +2751,7 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
     printk(XENLOG_INFO "Removing cpu %d from runqueue %d\n", cpu, rqi);
 
     __cpumask_clear_cpu(cpu, &rqd->idle);
+    __cpumask_clear_cpu(cpu, &rqd->smt_idle);
     __cpumask_clear_cpu(cpu, &rqd->active);
 
     if ( cpumask_empty(&rqd->active) )