From patchwork Fri Apr  8 01:24:21 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dario Faggioli <dario.faggioli@citrix.com>
X-Patchwork-Id: 8779571
Return-Path: <xen-devel-bounces@lists.xen.org>
X-Original-To: patchwork-xen-devel@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork1.web.kernel.org (Postfix) with ESMTP id EC5739F3D1
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Fri,  8 Apr 2016 01:26:25 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id C3B8A201C7
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Fri,  8 Apr 2016 01:26:24 +0000 (UTC)
Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120])
	(using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 863DB20295
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Fri,  8 Apr 2016 01:26:23 +0000 (UTC)
Received: from localhost ([127.0.0.1] helo=lists.xenproject.org)
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <xen-devel-bounces@lists.xen.org>)
	id 1aoL9v-00085O-6m; Fri, 08 Apr 2016 01:24:27 +0000
Received: from mail6.bemta14.messagelabs.com ([193.109.254.103])
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <raistlin.df@gmail.com>) id 1aoL9t-00084Z-PJ
	for xen-devel@lists.xenproject.org; Fri, 08 Apr 2016 01:24:25 +0000
Received: from [193.109.254.147] by server-14.bemta-14.messagelabs.com id
	BE/E9-02987-94807075; Fri, 08 Apr 2016 01:24:25 +0000
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrLIsWRWlGSWpSXmKPExsXiVRvkouvBwR5
	u0Dqd1eL7lslMDowehz9cYQlgjGLNzEvKr0hgzfiwtbZgh03FxQfHmRsYLxt0MXJxCAlMZ5SY
	tPAMI4jDIrCGVeLJp2tMII6EwCVWic/HVrB1MXICOTESG5d9Z4awKyT2zp0JZgsJqEjc3L6KC
	WLUYiaJrz/esYMkhAX0JI4c/QFlJ0rsO7CUCcRmEzCQeLNjLyuILSKgJHFv1WSwOLNAucTnX4
	/AhrIIqEr09d0F6uXg4BWwl3j7MBUkzCngIDFxw1MmiL32Etu6T4DZogJyEisvt4CN5BUQlDg
	58wkLSCuzgKbE+l36ENPlJba/ncM8gVFkFpKqWQhVs5BULWBkXsWoUZxaVJZapGtorJdUlJme
	UZKbmJmja2hoopebWlycmJ6ak5hUrJecn7uJERj89QwMjDsYd233PMQoycGkJMp75Q1buBBfU
	n5KZUZicUZ8UWlOavEhRg0ODoEJZ+dOZ5JiycvPS1WS4F3Ixh4uJFiUmp5akZaZA4xPmFIJDh
	4lEV43kDRvcUFibnFmOkTqFKMux5ap99YyCYHNkBLnLQYpEgApyijNgxsBSxWXGGWlhHkZGRg
	YhHgKUotyM0tQ5V8xinMwKgnzPgaZwpOZVwK36RXQEUxAR1zgZwM5oiQRISXVwJi6xVG8dva9
	vk8cd9c4dba//D3nz5HoTSdfu52SLnFMOr3lz67fkb/LC8I+Vjhx/bRv1LiYoxCT2THPf2fWS
	f7zfTMn+Z67s8dA85z45hax4zuZvzrb7PNSn7nkHJ/zdtcLpVId56QvSnxomFGgJRSrw8LtfV
	uSlf3pa/3Lixk/C6jP4z1pNFeJpTgj0VCLuag4EQBro4wfEAMAAA==
X-Env-Sender: raistlin.df@gmail.com
X-Msg-Ref: server-4.tower-27.messagelabs.com!1460078663!35395186!1
X-Originating-IP: [74.125.82.68]
X-SpamReason: No, hits=0.0 required=7.0 tests=
X-StarScan-Received: 
X-StarScan-Version: 8.28; banners=-,-,-
X-VirusChecked: Checked
Received: (qmail 37852 invoked from network); 8 Apr 2016 01:24:23 -0000
Received: from mail-wm0-f68.google.com (HELO mail-wm0-f68.google.com)
	(74.125.82.68)
	by server-4.tower-27.messagelabs.com with AES128-GCM-SHA256 encrypted
	SMTP; 8 Apr 2016 01:24:23 -0000
Received: by mail-wm0-f68.google.com with SMTP id a140so750877wma.2
	for <xen-devel@lists.xenproject.org>;
	Thu, 07 Apr 2016 18:24:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=sender:subject:from:to:cc:date:message-id:in-reply-to:references
	:user-agent:mime-version:content-transfer-encoding;
	bh=sxr0so5h4OK7gUrXOcV1K8mAujQC47qqE4MePzye9pc=;
	b=TaBByNLbta+LOswNuOiMjhiZ6hmwKvTlfyIkYHVnTkKhnhIiuLMnuBv+Bcbkodudu6
	ZOiqCNIX1pSJaGqH/4jsJKa2R4JM5Tq5j5wDZqOmPoosZ6Zz435024rJIBW6bkWgbsF7
	ud8dFnRRQYn59N09BanP+LvK1iBGhFccJUM83LhgNMKh7v1EFazbWzGd1OFJzAaNE9Zd
	LUY+azm64P47+rzSL8HT9Zrr4Po42ByQAt+8xcEUgMCZKOq1mD7IqNKobl8m/newp0kD
	bPKz5RaAAivbT8vS+PLN6OE8r5wQM4jlbr9hS304eUMTDAxBi4XZtkn3UmU+dmwFqlrU
	FmHw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:sender:subject:from:to:cc:date:message-id
	:in-reply-to:references:user-agent:mime-version
	:content-transfer-encoding;
	bh=sxr0so5h4OK7gUrXOcV1K8mAujQC47qqE4MePzye9pc=;
	b=getndj+YDINsibHy8400hAJrEz/dAgPeMqc+IegtVpH3ruoBP/ZbHfEQxsEJguMbq+
	VKWaL0QbOGiqrhb/iTBihpE635Q4qhpJQJNyQPw7hwbDHv0z49Ku+BY68VfEKCHhMOlY
	pGrXu6CTOasmXOWr2CAE5uvnDUbwqQKxthufmY9N/SUL2Qg8M3sdr0+V9gIz6d1jP9sa
	Ue9oe1BE1za5zsAvblCaEgqCzjGNVNGxdFfesKgaNftBbGXENGBe20ktq14CrKxUdgtT
	xKwppIOd8AZ8TFbuEWE/FMC6DsN60+G60I7OzN7wbkCf8LJaW7jx03MGHsM8WJ06xOD8
	y+Ag==
X-Gm-Message-State: 
 AD7BkJK/Q257CbdISgg89zqfdzuGpa1EeHFM0Hu44or3hv1jEFQfr7YYPfqMXw9K0kW8ng==
X-Received: by 10.28.133.14 with SMTP id h14mr615956wmd.100.1460078663529;
	Thu, 07 Apr 2016 18:24:23 -0700 (PDT)
Received: from Solace.fritz.box (net-37-116-155-252.cust.vodafonedsl.it.
	[37.116.155.252]) by smtp.gmail.com with ESMTPSA id
	g3sm10937711wjw.31.2016.04.07.18.24.21
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Thu, 07 Apr 2016 18:24:22 -0700 (PDT)
From: Dario Faggioli <dario.faggioli@citrix.com>
To: xen-devel@lists.xenproject.org
Date: Fri, 08 Apr 2016 03:24:21 +0200
Message-ID: <20160408012420.10762.61178.stgit@Solace.fritz.box>
In-Reply-To: <20160408011204.10762.14241.stgit@Solace.fritz.box>
References: <20160408011204.10762.14241.stgit@Solace.fritz.box>
User-Agent: StGit/0.17.1-dirty
MIME-Version: 1.0
Cc: George Dunlap <george.dunlap@eu.citrix.com>,
	Juergen Gross <jgross@suse.com>, Uma Sharma <uma.sharma523@gmail.com>
Subject: [Xen-devel] [PATCH v3 08/11] xen: sched: allow for choosing credit2
	runqueues configuration at boot
X-BeenThere: xen-devel@lists.xen.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Xen developer discussion <xen-devel.lists.xen.org>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
X-Spam-Status: No, score=-4.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_MED, T_DKIM_INVALID,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

In fact, credit2 uses CPU topology to decide how to arrange
its internal runqueues. Before this change, only 'one runqueue
per socket' was allowed. However, experiments have shown that,
for instance, having one runqueue per physical core improves
performance, especially in case hyperthreading is available.

In general, it makes sense to allow users to pick one runqueue
arrangement at boot time, so that:
 - more experiments can be easily performed to even better
   assess and improve performance;
 - one can select the best configuration for his specific
   use case and/or hardware.

This patch enables the above.

Note that, for correctly arranging runqueues to be per-core,
just checking cpu_to_core() on the host CPUs is not enough.
In fact, cores (and hyperthreads) on different sockets, can
have the same core (and thread) IDs! We, therefore, need to
check whether the full topology of two CPUs matches, for
them to be put in the same runqueue.

Note also that the default (although not functional) for
credit2, since now, has been per-socket runqueue. This patch
leaves things that way, to avoid mixing policy and technical
changes.

Finally, it would be a nice feature to be able to select
a particular runqueue arrangement, even when creating a
Credit2 cpupool. This is left as future work.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Uma Sharma <uma.sharma523@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
---
Changes from v2:
 * valid strings  are now in an array, that we scan during
   parameter parsing, as suggested during review.

Cahnges from v1:
 * fix bug in parameter parsing, and start using strcmp()
   for that, as requested during review.
---
 docs/misc/xen-command-line.markdown |   19 ++++++++
 xen/common/sched_credit2.c          |   83 +++++++++++++++++++++++++++++++++--
 2 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index ca77e3b..0047f94 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -469,6 +469,25 @@ combination with the `low_crashinfo` command line option.
 ### credit2\_load\_window\_shift
 > `= <integer>`
 
+### credit2\_runqueue
+> `= core | socket | node | all`
+
+> Default: `socket`
+
+Specify how host CPUs are arranged in runqueues. Runqueues are kept
+balanced with respect to the load generated by the vCPUs running on
+them. Smaller runqueues (as in with `core`) means more accurate load
+balancing (for instance, it will deal better with hyperthreading),
+but also more overhead.
+
+Available alternatives, with their meaning, are:
+* `core`: one runqueue per each physical core of the host;
+* `socket`: one runqueue per each physical socket (which often,
+            but not always, matches a NUMA node) of the host;
+* `node`: one runqueue per each NUMA node of the host;
+* `all`: just one runqueue shared by all the logical pCPUs of
+         the host
+
 ### dbgp
 > `= ehci[ <integer> | @pci<bus>:<slot>.<func> ]`
 
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index a61a45a..eeb3f54 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -81,10 +81,6 @@
  * Credits are "reset" when the next vcpu in the runqueue is less than
  * or equal to zero.  At that point, everyone's credits are "clipped"
  * to a small value, and a fixed credit is added to everyone.
- *
- * The plan is for all cores that share an L2 will share the same
- * runqueue.  At the moment, there is one global runqueue for all
- * cores.
  */
 
 /*
@@ -193,6 +189,63 @@ static int __read_mostly opt_overload_balance_tolerance = -3;
 integer_param("credit2_balance_over", opt_overload_balance_tolerance);
 
 /*
+ * Runqueue organization.
+ *
+ * The various cpus are to be assigned each one to a runqueue, and we
+ * want that to happen basing on topology. At the moment, it is possible
+ * to choose to arrange runqueues to be:
+ *
+ * - per-core: meaning that there will be one runqueue per each physical
+ *             core of the host. This will happen if the opt_runqueue
+ *             parameter is set to 'core';
+ *
+ * - per-node: meaning that there will be one runqueue per each physical
+ *             NUMA node of the host. This will happen if the opt_runqueue
+ *             parameter is set to 'node';
+ *
+ * - per-socket: meaning that there will be one runqueue per each physical
+ *               socket (AKA package, which often, but not always, also
+ *               matches a NUMA node) of the host; This will happen if
+ *               the opt_runqueue parameter is set to 'socket';
+ *
+ * - global: meaning that there will be only one runqueue to which all the
+ *           (logical) processors of the host belongs. This will happen if
+ *           the opt_runqueue parameter is set to 'all'.
+ *
+ * Depending on the value of opt_runqueue, therefore, cpus that are part of
+ * either the same physical core, or of the same physical socket, will be
+ * put together to form runqueues.
+ */
+#define OPT_RUNQUEUE_CORE   0
+#define OPT_RUNQUEUE_SOCKET 1
+#define OPT_RUNQUEUE_NODE   2
+#define OPT_RUNQUEUE_ALL    3
+static const char *const opt_runqueue_str[] = {
+    [OPT_RUNQUEUE_CORE] = "core",
+    [OPT_RUNQUEUE_SOCKET] = "socket",
+    [OPT_RUNQUEUE_NODE] = "node",
+    [OPT_RUNQUEUE_ALL] = "all"
+};
+static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET;
+
+static void parse_credit2_runqueue(const char *s)
+{
+    unsigned int i;
+
+    for ( i = 0; i <= OPT_RUNQUEUE_ALL; i++ )
+    {
+        if ( !strcmp(s, opt_runqueue_str[i]) )
+        {
+            opt_runqueue = i;
+            return;
+        }
+    }
+
+    printk("WARNING, unrecognized value of credit2_runqueue option!\n");
+}
+custom_param("credit2_runqueue", parse_credit2_runqueue);
+
+/*
  * Per-runqueue data
  */
 struct csched2_runqueue_data {
@@ -1974,6 +2027,22 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
     cpumask_clear_cpu(rqi, &prv->active_queues);
 }
 
+static inline bool_t same_node(unsigned int cpua, unsigned int cpub)
+{
+    return cpu_to_node(cpua) == cpu_to_node(cpub);
+}
+
+static inline bool_t same_socket(unsigned int cpua, unsigned int cpub)
+{
+    return cpu_to_socket(cpua) == cpu_to_socket(cpub);
+}
+
+static inline bool_t same_core(unsigned int cpua, unsigned int cpub)
+{
+    return same_socket(cpua, cpub) &&
+           cpu_to_core(cpua) == cpu_to_core(cpub);
+}
+
 static unsigned int
 cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu)
 {
@@ -2006,7 +2075,10 @@ cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu)
         BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID ||
                cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID);
 
-        if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) )
+        if ( opt_runqueue == OPT_RUNQUEUE_ALL ||
+             (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) ||
+             (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) ||
+             (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) )
             break;
     }
 
@@ -2170,6 +2242,7 @@ csched2_init(struct scheduler *ops)
     printk(" load_window_shift: %d\n", opt_load_window_shift);
     printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance);
     printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance);
+    printk(" runqueues arrangement: %s\n", opt_runqueue_str[opt_runqueue]);
 
     if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN )
     {