From patchwork Fri Aug 18 15:50:53 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dario Faggioli <dario.faggioli@citrix.com>
X-Patchwork-Id: 9909533
Return-Path: <xen-devel-bounces@lists.xen.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	8B1F1602C8 for <patchwork-xen-devel@patchwork.kernel.org>;
	Fri, 18 Aug 2017 15:53:49 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 77C0628CE2
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Fri, 18 Aug 2017 15:53:49 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 6C6DD28D00; Fri, 18 Aug 2017 15:53:49 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.6 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_MED,RCVD_IN_SORBS_SPAM,T_DKIM_INVALID autolearn=ham
	version=3.3.1
Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120])
	(using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 9FA7A28CE2
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Fri, 18 Aug 2017 15:53:47 +0000 (UTC)
Received: from localhost ([127.0.0.1] helo=lists.xenproject.org)
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <xen-devel-bounces@lists.xen.org>)
	id 1dijY3-0008KC-SC; Fri, 18 Aug 2017 15:50:59 +0000
Received: from mail6.bemta6.messagelabs.com ([193.109.254.103])
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <raistlin.df@gmail.com>) id 1dijY2-0008Jt-Gh
	for xen-devel@lists.xenproject.org; Fri, 18 Aug 2017 15:50:58 +0000
Received: from [85.158.143.35] by server-7.bemta-6.messagelabs.com id
	78/F4-03557-1EC07995; Fri, 18 Aug 2017 15:50:57 +0000
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFupileJIrShJLcpLzFFi42K5GNqwUfcBz/R
	Ig9aj8hbft0xmcmD0OPzhCksAYxRrZl5SfkUCa8bLpnuMBbNnMFXcPZfVwHj9PGMXIxeHkMBM
	RonO7//ZQRwWgTWsEr0nLwM5nBwSApdYJZasCoKw4yQeNS1hg7CrJXraHzCD2EICKhI3t69ig
	rB/MkpMXsUFYgsL6EkcOfqDHcK2l/jT9xqsl03AQOLNjr2sILaIgJLEvVWTmUAWMwssYJI4+b
	2bESTBIqAqsXLHL6AiDg5eAS+J1QttQcKcAt4Sbz+9ZYHY5SXxeMdEsL2iAnISKy+3gM3kFRC
	UODnzCQtIK7OApsT6XfogYWYBeYntb+cwT2AUmYWkahZC1SwkVQsYmVcxahSnFpWlFukaGukl
	FWWmZ5TkJmbm6BoamOnlphYXJ6an5iQmFesl5+duYgSGPwMQ7GC8vDHgEKMkB5OSKO/vWVMih
	fiS8lMqMxKLM+KLSnNSiw8xanBwCEw4O3c6kxRLXn5eqpIEbzX39EghwaLU9NSKtMwcYITClE
	pw8CiJ8OqBpHmLCxJzizPTIVKnGC05rlxZ94WJY8Pq9UByyoHtX5iEwOZJifPmgTQIgDRklOb
	BjYMlkUuMslLCvIxAxwrxFKQW5WaWoMq/YhTnYFQS5nUDmcKTmVcCt/UV0EFMQAcZtk4DOagk
	ESEl1cAYzyvPaFX05/ze61fuRSc5TP0uu/npwYPsP47dXKuyb6aocPmyw87tcSmPrfZebWPwZ
	32lWqGx5EXoOu2iKTcYUv22SXvFfZW+EPrnDmvnA0k2nZ+qixNPfuV34Za/HXCj0MR30Vov3g
	Xd/z+Ic3gsiJslduD/g+c1S2V2tW7T0/AUN/9/cvYmJZbijERDLeai4kQAgHQfqB0DAAA=
X-Env-Sender: raistlin.df@gmail.com
X-Msg-Ref: server-8.tower-21.messagelabs.com!1503071456!78433528!1
X-Originating-IP: [209.85.128.177]
X-SpamReason: No, hits=0.5 required=7.0 tests=BODY_RANDOM_LONG
X-StarScan-Received: 
X-StarScan-Version: 9.4.45; banners=-,-,-
X-VirusChecked: Checked
Received: (qmail 56588 invoked from network); 18 Aug 2017 15:50:56 -0000
Received: from mail-wr0-f177.google.com (HELO mail-wr0-f177.google.com)
	(209.85.128.177)
	by server-8.tower-21.messagelabs.com with AES128-GCM-SHA256 encrypted
	SMTP; 18 Aug 2017 15:50:56 -0000
Received: by mail-wr0-f177.google.com with SMTP id z91so63777071wrc.4
	for <xen-devel@lists.xenproject.org>;
	Fri, 18 Aug 2017 08:50:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=sender:subject:from:to:cc:date:message-id:in-reply-to:references
	:user-agent:mime-version:content-transfer-encoding;
	bh=L4G1N9q1oZac97iv9QgGG0MpmSfcyrUEEwlRQqbKjLA=;
	b=hxAWv1LGDQD+vlJH1b+hQRF0uShMxSUUBXhMrejw4Va2T+HRu0oGKjxWPl8+GA+9XM
	aopzWRxd6L3TeMKigF3h7o501WG2FbM3NfJSSYCykoWvutjc7fy9nCxp3o6afbG7i9fm
	kRlauYVvpNagkGNAOPd3M633nwW5MIf8Q+t4OwWNe830giUcYYYg/eC2cYSJjqPAF/xm
	VRYR9/KIjfsKUYXgQTwaFJLaJvwIoFk8DdPq1dyvpcGWZSe1qFZY30Bg9kLZ7FKOsNvt
	kLUWuonC9Zjb7x0scgN/N2X2TpvZBWpE+FgnW6qMSh1P483csmNHYFXHCkhuA3Ebnp8+
	OEyg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:sender:subject:from:to:cc:date:message-id
	:in-reply-to:references:user-agent:mime-version
	:content-transfer-encoding;
	bh=L4G1N9q1oZac97iv9QgGG0MpmSfcyrUEEwlRQqbKjLA=;
	b=Bht4UlW5xm0zImYDPuaZeAvSnhXgpk+GyN2ZiRvUv6vj8M+FglkY85edX9VL5t0tDN
	rBu2V2ax5QpgSq8pqUT5AWxZZxkeYpyqoF7n38M0ucjB1GG3YcBp/ahLvIxKaVSkx5J3
	5A0gC30NiznEbHpjVFxKxhDlirjwDG1NeFdyKQ7HdBIGAsmfG5tAC95ySxJpGt9zN7B1
	OLt+ZCizqNe7wgDrQJlQtjlY2clVfADU/OA7VEDmckFi/PzjzHCMRklgE5/MKpMt4U/R
	zBiAIFEiRTLQ2u/f+x72xERBw2tRGWQjbaK41vnwhR1Csayy2rAHFSIuIWEfj6PeE+bE
	gk+g==
X-Gm-Message-State: AHYfb5jhnOdO1o9eFtptO+g/ipl3PuiTif/mwEdao5FXTkrwOfxJz2mq
	b4oLjYFic5eiXg==
X-Received: by 10.223.197.139 with SMTP id m11mr6676111wrg.217.1503071455893;
	Fri, 18 Aug 2017 08:50:55 -0700 (PDT)
Received: from Solace.fritz.box ([80.66.223.3])
	by smtp.gmail.com with ESMTPSA id
	u199sm4224257wmd.26.2017.08.18.08.50.54
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Fri, 18 Aug 2017 08:50:55 -0700 (PDT)
From: Dario Faggioli <dario.faggioli@citrix.com>
To: xen-devel@lists.xenproject.org
Date: Fri, 18 Aug 2017 17:50:53 +0200
Message-ID: <150307145322.6642.8867195330176310748.stgit@Solace.fritz.box>
In-Reply-To: <150307081385.6642.6516202758428761422.stgit@Solace.fritz.box>
References: <150307081385.6642.6516202758428761422.stgit@Solace.fritz.box>
User-Agent: StGit/0.17.1-dirty
MIME-Version: 1.0
Cc: Wei Liu <wei.liu2@citrix.com>,
	George Dunlap <george.dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	Anshul Makkar <anshulmakkar@gmail.com>, Jan Beulich <jbeulich@suse.com>
Subject: [Xen-devel] [PATCH v2 1/4] xen: credit2: implement utilization cap
X-BeenThere: xen-devel@lists.xen.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Xen developer discussion <xen-devel.lists.xen.org>
List-Unsubscribe: <https://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <https://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
X-Virus-Scanned: ClamAV using ClamSMTP

This commit implements the Xen part of the cap mechanism for
Credit2.

A cap is how much, in terms of % of physical CPU time, a domain
can execute at most.

For instance, a domain that must not use more than 1/4 of
one physical CPU, must have a cap of 25%; one that must not
use more than 1+1/2 of physical CPU time, must be given a cap
of 150%.

Caps are per domain, so it is all a domain's vCPUs, cumulatively,
that will be forced to execute no more than the decided amount.

This is implemented by giving each domain a 'budget', and
using a (per-domain again) periodic timer. Values of budget
and 'period' are chosen so that budget/period is equal to the
cap itself.

Budget is burned by the domain's vCPUs, in a similar way to
how credits are.

When a domain runs out of budget, its vCPUs can't run any
longer. They can gain, when the budget is replenishment by
the timer, which event happens once every period.

Blocking the vCPUs because of lack of budget happens by
means of a new (_VPF_parked) pause flag, so that, e.g.,
vcpu_runnable() still works. This is similar to what is
done in sched_rtds.c, as opposed to what happens in
sched_credit.c, where vcpu_pause() and vcpu_unpause()
(which means, among other things, more overhead).

Note that, while adding new fields to csched2_vcpu and
csched2_dom, currently existing members are being moved
around, to achieve best placement inside cache lines.

Note also that xenalyze and tools/xentrace/format are being
updated too.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Anshul Makkar <anshulmakkar@gmail.com>
---
Changed from v1:
* used has_cap() instead of open coding it in burn_credits();
* removed some of the unlikely() around has_cap(), as, although cap is not on
  by default, it's up to the user to decide how many domains will have caps,
  and we can't assume much about what users will actually do;
* tried to clarify the comment about (the non deterministic nature of the)
  CPU capacity distribution between the vCPUs of a multi-vCPUs guest;
* clarified the comment about budget being replenished to nothing more than
  top capacity, i.e., about the fact that budget is *not* being accumulated
  across different period;
* fixed many style and typo issues in comments;
* added a comment about the budget distribution logic (to the vCPUs) being
  subject to be refined in subsequent commits;
* renaming:
   vcpu_try_to_get_budget() --> vcpu_grab_budget()
   vcpu_give_back_budget() --> vcpu_return_budget()
   repl_sdom_budget() --> replanish_domain_budget()
* change how replenishment logic deals with cases of overrun. In v1, we were
  always doing multiple replenishment at once, until the domain's budget was
  back into the black. Now, in cases of substantial overrun, we just do one
  replenishment, and rely on future ones to bring back the budget into
  being a positive number. This was agreed upon with George during v1's
  review.
---
 tools/xentrace/formats     |    2 
 tools/xentrace/xenalyze.c  |   10 +
 xen/common/sched_credit2.c |  521 +++++++++++++++++++++++++++++++++++++++++---
 xen/include/xen/sched.h    |    3 
 4 files changed, 492 insertions(+), 44 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index f39182a..d6e7e3f 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -51,7 +51,7 @@
 
 0x00022201  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tick
 0x00022202  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_pos       [ dom:vcpu = 0x%(1)08x, pos = %(2)d]
-0x00022203  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit burn    [ dom:vcpu = 0x%(1)08x, credit = %(2)d, delta = %(3)d ]
+0x00022203  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit burn    [ dom:vcpu = 0x%(1)08x, credit = %(2)d, budget = %(3)d, delta = %(4)d ]
 0x00022204  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit_add
 0x00022205  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tickle_check   [ dom:vcpu = 0x%(1)08x, credit = %(2)d, score = %(3)d ]
 0x00022206  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tickle         [ cpu = %(1)d ]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index 39fc35f..79bdba7 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7680,12 +7680,14 @@ void sched_process(struct pcpu_info *p)
             if(opt.dump_all) {
                 struct {
                     unsigned int vcpuid:16, domid:16;
-                    int credit, delta;
+                    int credit, budget, delta;
                 } *r = (typeof(r))ri->d;
 
-                printf(" %s csched2:burn_credits d%uv%u, credit = %d, delta = %d\n",
-                       ri->dump_header, r->domid, r->vcpuid,
-                       r->credit, r->delta);
+                printf(" %s csched2:burn_credits d%uv%u, credit = %d, ",
+                       ri->dump_header, r->domid, r->vcpuid, r->credit);
+                if ( r->budget != INT_MIN )
+                    printf("budget = %d, ", r->budget);
+                printf("delta = %d\n", r->delta);
             }
             break;
         case TRC_SCHED_CLASS_EVT(CSCHED2, 5): /* TICKLE_CHECK      */
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index fab7f2e..69a7679 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -92,6 +92,86 @@
  */
 
 /*
+ * Utilization cap:
+ *
+ * Setting an pCPU utilization cap for a domain means the following:
+ *
+ * - a domain can have a cap, expressed in terms of % of physical CPU time.
+ *   A domain that must not use more than 1/4 of _one_ physical CPU, will
+ *   be given a cap of 25%; a domain that must not use more than 1+1/2 of
+ *   physical CPU time, will be given a cap of 150%;
+ *
+ * - caps are per-domain (not per-vCPU). If a domain has only 1 vCPU, and
+ *   a 40% cap, that one vCPU will use 40% of one pCPU. If a somain has 4
+ *   vCPUs, and a 200% cap, the equivalent of 100% time on 2 pCPUs will be
+ *   split among the v vCPUs. How much each of the vCPUs will actually get,
+ *   during any given interval of time, is unspecified (as it depends on
+ *   various aspects: workload, system load, etc.). For instance, it is
+ *   possible that, during a given time interval, 2 vCPUs use 100% each,
+ *   and the other two use nothing; while during another time interval,
+ *   two vCPUs use 80%, one uses 10% and the other 30%; or that each use
+ *   50% (and so on and so forth).
+ *
+ * For implementing this, we use the following approach:
+ *
+ * - each domain is given a 'budget', an each domain has a timer, which
+ *   replenishes the domain's budget periodically. The budget is the amount
+ *   of time the vCPUs of the domain can use every 'period';
+ *
+ * - the period is CSCHED2_BDGT_REPL_PERIOD, and is the same for all domains
+ *   (but each domain has its own timer; so the all are periodic by the same
+ *   period, but replenishment of the budgets of the various domains, at
+ *   periods boundaries, are not synchronous);
+ *
+ * - when vCPUs run, they consume budget. When they don't run, they don't
+ *   consume budget. If there is no budget left for the domain, no vCPU of
+ *   that domain can run. If a vCPU tries to run and finds that there is no
+ *   budget, it blocks.
+ *   At whatever time a vCPU wants to run, it must check the domain's budget,
+ *   and if there is some, it can use it.
+ *
+ * - budget is replenished to the top of the capacity for the domain once
+ *   per period. Even if there was some leftover budget from previous period,
+ *   though, the budget after a replenishment will always be at most equal
+ *   to the total capacify of the domain ('tot_budget');
+ *
+ * - when a budget replenishment occurs, if there are vCPUs that had been
+ *   blocked because of lack of budget, they'll be unblocked, and they will
+ *   (potentially) be able to run again.
+ *
+ * Finally, some even more implementation related detail:
+ *
+ * - budget is stored in a domain-wide pool. vCPUs of the domain that want
+ *   to run go to such pool, and grub some. When they do so, the amount
+ *   they grabbed is _immediately_ removed from the pool. This happens in
+ *   vcpu_grab_budget();
+ *
+ * - when vCPUs stop running, if they've not consumed all the budget they
+ *   took, the leftover is put back in the pool. This happens in
+ *   vcpu_return_budget();
+ *
+ * - the above means that a vCPU can find out that there is no budget and
+ *   block, not only if the cap has actually been reached (for this period),
+ *   but also if some other vCPUs, in order to run, have grabbed a certain
+ *   quota of budget, no matter whether they've already used it all or not.
+ *   A vCPU blocking because (any form of) lack of budget is said to be
+ *   "parked", and such blocking happens in park_vcpu();
+ *
+ * - when a vCPU stops running, and puts back some budget in the domain pool,
+ *   we need to check whether there is someone which has been parked and that
+ *   can be unparked. This happens in unpark_parked_vcpus(), called from
+ *   csched2_context_saved();
+ *
+ * - of course, unparking happens also as a consequence of the domain's budget
+ *   being replenished by the periodic timer. This also occurs by means of
+ *   calling csched2_context_saved() (but from replenish_domain_budget());
+ *
+ * - parked vCPUs of a domain are kept in a (per-domain) list, called
+ *   'parked_vcpus'). Manipulation of the list and of the domain-wide budget
+ *   pool, must occur only when holding the 'budget_lock'.
+ */
+
+/*
  * Locking:
  *
  * - runqueue lock
@@ -112,18 +192,29 @@
  *     runqueue each cpu is;
  *  + serializes the operation of changing the weights of domains;
  *
+ * - Budget lock
+ *  + it is per-domain;
+ *  + protects, in domains that have an utilization cap;
+ *   * manipulation of the total budget of the domain (as it is shared
+ *     among all vCPUs of the domain),
+ *   * manipulation of the list of vCPUs that are blocked waiting for
+ *     some budget to be available.
+ *
  * - Type:
  *  + runqueue locks are 'regular' spinlocks;
  *  + the private scheduler lock can be an rwlock. In fact, data
  *    it protects is modified only during initialization, cpupool
  *    manipulation and when changing weights, and read in all
- *    other cases (e.g., during load balancing).
+ *    other cases (e.g., during load balancing);
+ *  + budget locks are 'regular' spinlocks.
  *
  * Ordering:
  *  + tylock must be used when wanting to take a runqueue lock,
  *    if we already hold another one;
  *  + if taking both a runqueue lock and the private scheduler
- *    lock is, the latter must always be taken for first.
+ *    lock is, the latter must always be taken for first;
+ *  + if taking both a runqueue lock and a budget lock, the former
+ *    must always be taken for first.
  */
 
 /*
@@ -166,6 +257,8 @@
 #define CSCHED2_CREDIT_RESET         0
 /* Max timer: Maximum time a guest can be run for. */
 #define CSCHED2_MAX_TIMER            CSCHED2_CREDIT_INIT
+/* Period of the cap replenishment timer. */
+#define CSCHED2_BDGT_REPL_PERIOD     ((opt_cap_period)*MILLISECS(1))
 
 /*
  * Flags
@@ -295,6 +388,14 @@ static int __read_mostly opt_underload_balance_tolerance = 0;
 integer_param("credit2_balance_under", opt_underload_balance_tolerance);
 static int __read_mostly opt_overload_balance_tolerance = -3;
 integer_param("credit2_balance_over", opt_overload_balance_tolerance);
+/*
+ * Domains subject to a cap receive a replenishment of their runtime budget
+ * once every opt_cap_period interval. Default is 10 ms. The amount of budget
+ * they receive depends on their cap. For instance, a domain with a 50% cap
+ * will receive 50% of 10 ms, so 5 ms.
+ */
+static unsigned int __read_mostly opt_cap_period = 10;    /* ms */
+integer_param("credit2_cap_period_ms", opt_cap_period);
 
 /*
  * Runqueue organization.
@@ -411,13 +512,15 @@ static DEFINE_PER_CPU(int, runq_map);
  * Virtual CPU
  */
 struct csched2_vcpu {
-    struct list_head rqd_elem;         /* On csched2_runqueue_data's svc list */
+    struct csched2_dom *sdom;          /* Up-pointer to domain                */
+    struct vcpu *vcpu;                 /* Up-pointer, to vcpu                 */
     struct csched2_runqueue_data *rqd; /* Up-pointer to the runqueue          */
 
     int credit;                        /* Current amount of credit            */
     unsigned int weight;               /* Weight of this vcpu                 */
     unsigned int residual;             /* Reminder of div(max_weight/weight)  */
     unsigned flags;                    /* Status flags (16 bits would be ok,  */
+    s_time_t budget;                   /* Current budget (if domains has cap) */
                                        /* but clear_bit() does not like that) */
     s_time_t start_time;               /* Time we were scheduled (for credit) */
 
@@ -426,9 +529,8 @@ struct csched2_vcpu {
     s_time_t avgload;                  /* Decaying queue load                 */
 
     struct list_head runq_elem;        /* On the runqueue (rqd->runq)         */
-    struct csched2_dom *sdom;          /* Up-pointer to domain                */
-    struct vcpu *vcpu;                 /* Up-pointer, to vcpu                 */
-
+    struct list_head parked_elem;      /* On the parked_vcpus list            */
+    struct list_head rqd_elem;         /* On csched2_runqueue_data's svc list */
     struct csched2_runqueue_data *migrate_rqd; /* Pre-determined migr. target */
     int tickled_cpu;                   /* Cpu that will pick us (-1 if none)  */
 };
@@ -437,9 +539,19 @@ struct csched2_vcpu {
  * Domain
  */
 struct csched2_dom {
-    struct list_head sdom_elem; /* On csched2_runqueue_data's sdom list       */
     struct domain *dom;         /* Up-pointer to domain                       */
+
+    spinlock_t budget_lock;     /* Serialized budget calculations             */
+    s_time_t tot_budget;        /* Total amount of budget                     */
+    s_time_t budget;            /* Currently available budget                 */
+
+    struct timer *repl_timer;   /* Timer for periodic replenishment of budget */
+    s_time_t next_repl;         /* Time at which next replenishment occurs    */
+    struct list_head parked_vcpus; /* List of CPUs waiting for budget         */
+
+    struct list_head sdom_elem; /* On csched2_runqueue_data's sdom list       */
     uint16_t weight;            /* User specified weight                      */
+    uint16_t cap;               /* User specified cap                         */
     uint16_t nr_vcpus;          /* Number of vcpus of this domain             */
 };
 
@@ -474,6 +586,12 @@ static inline struct csched2_runqueue_data *c2rqd(const struct scheduler *ops,
     return &csched2_priv(ops)->rqd[c2r(cpu)];
 }
 
+/* Does the domain of this vCPU have a cap? */
+static inline bool has_cap(const struct csched2_vcpu *svc)
+{
+    return svc->budget != STIME_MAX;
+}
+
 /*
  * Hyperthreading (SMT) support.
  *
@@ -1515,7 +1633,16 @@ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now,
          * that the credit it has spent so far get accounted.
          */
         if ( svc->vcpu == curr_on_cpu(svc_cpu) )
+        {
             burn_credits(rqd, svc, now);
+            /*
+             * And, similarly, in case it has run out of budget, as a
+             * consequence of this round of accounting, we also must inform
+             * its pCPU that it's time to park it, and pick up someone else.
+             */
+            if ( unlikely(svc->budget <= 0) )
+                tickle_cpu(svc_cpu, rqd);
+        }
 
         start_credit = svc->credit;
 
@@ -1571,27 +1698,35 @@ void burn_credits(struct csched2_runqueue_data *rqd,
 
     delta = now - svc->start_time;
 
-    if ( likely(delta > 0) )
-    {
-        SCHED_STAT_CRANK(burn_credits_t2c);
-        t2c_update(rqd, delta, svc);
-        svc->start_time = now;
-    }
-    else if ( delta < 0 )
+    if ( unlikely(delta <= 0) )
     {
-        d2printk("WARNING: %s: Time went backwards? now %"PRI_stime" start_time %"PRI_stime"\n",
-                 __func__, now, svc->start_time);
+        if ( unlikely(delta < 0) )
+            d2printk("WARNING: %s: Time went backwards? now %"PRI_stime
+                     " start_time %"PRI_stime"\n", __func__, now,
+                     svc->start_time);
+        goto out;
     }
 
+    SCHED_STAT_CRANK(burn_credits_t2c);
+    t2c_update(rqd, delta, svc);
+
+    if ( has_cap(svc) )
+        svc->budget -= delta;
+
+    svc->start_time = now;
+
+ out:
     if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned vcpu:16, dom:16;
-            int credit, delta;
+            int credit, budget;
+            int delta;
         } d;
         d.dom = svc->vcpu->domain->domain_id;
         d.vcpu = svc->vcpu->vcpu_id;
         d.credit = svc->credit;
+        d.budget = has_cap(svc) ?  svc->budget : INT_MIN;
         d.delta = delta;
         __trace_var(TRC_CSCHED2_CREDIT_BURN, 1,
                     sizeof(d),
@@ -1599,6 +1734,248 @@ void burn_credits(struct csched2_runqueue_data *rqd,
     }
 }
 
+/*
+ * Budget-related code.
+ */
+
+static void park_vcpu(struct csched2_vcpu *svc)
+{
+    struct vcpu *v = svc->vcpu;
+
+    ASSERT(spin_is_locked(&svc->sdom->budget_lock));
+
+    /*
+     * It was impossible to find budget for this vCPU, so it has to be
+     * "parked". This implies it is not runnable, so we mark it as such in
+     * its pause_flags. If the vCPU is currently scheduled (which means we
+     * are here after being called from within csched_schedule()), flagging
+     * is enough, as we'll choose someone else, and then context_saved()
+     * will take care of updating the load properly.
+     *
+     * If, OTOH, the vCPU is sitting in the runqueue (which means we are here
+     * after being called from within runq_candidate()), we must go all the
+     * way down to taking it out of there, and updating the load accordingly.
+     *
+     * In both cases, we also add it to the list of parked vCPUs of the domain.
+     */
+    __set_bit(_VPF_parked, &v->pause_flags);
+    if ( vcpu_on_runq(svc) )
+    {
+        runq_remove(svc);
+        update_load(svc->sdom->dom->cpupool->sched, svc->rqd, svc, -1, NOW());
+    }
+    list_add(&svc->parked_elem, &svc->sdom->parked_vcpus);
+}
+
+static bool vcpu_grab_budget(struct csched2_vcpu *svc)
+{
+    struct csched2_dom *sdom = svc->sdom;
+    unsigned int cpu = svc->vcpu->processor;
+
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
+
+    if ( svc->budget > 0 )
+        return true;
+
+    /* budget_lock nests inside runqueue lock. */
+    spin_lock(&sdom->budget_lock);
+
+    /*
+     * Here, svc->budget is <= 0 (as, if it was > 0, we'd have taken the if
+     * above!). That basically means the vCPU has overrun a bit --because of
+     * various reasons-- and we want to take that into account. With the +=,
+     * we are actually subtracting the amount of budget the vCPU has
+     * overconsumed, from the total domain budget.
+     */
+    sdom->budget += svc->budget;
+
+    if ( sdom->budget > 0 )
+    {
+        /*
+         * NB: we give the whole remaining budget a domain has, to the first
+         * vCPU that comes here and asks for it. This means that, in a domain
+         * with a cap, only 1 vCPU is able to run, at any given time.
+         * /THIS IS GOING TO CHANGE/ in subsequent patches, toward something
+         * that allows much better fairness and parallelism. Proceeding in
+         * two steps, is for making things easy to understand, when looking
+         * at the signle commits.
+         */
+        svc->budget = sdom->budget;
+        sdom->budget = 0;
+    }
+    else
+    {
+        svc->budget = 0;
+        park_vcpu(svc);
+    }
+
+    spin_unlock(&sdom->budget_lock);
+
+    return svc->budget > 0;
+}
+
+static void
+vcpu_return_budget(struct csched2_vcpu *svc, struct list_head *parked)
+{
+    struct csched2_dom *sdom = svc->sdom;
+    unsigned int cpu = svc->vcpu->processor;
+
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
+    ASSERT(list_empty(parked));
+
+    /* budget_lock nests inside runqueue lock. */
+    spin_lock(&sdom->budget_lock);
+
+    /*
+     * The vCPU is stopping running (e.g., because it's blocking, or it has
+     * been preempted). If it hasn't consumed all the budget it got when,
+     * starting to run, put that remaining amount back in the domain's budget
+     * pool.
+     */
+    sdom->budget += svc->budget;
+    svc->budget = 0;
+
+    /*
+     * Making budget available again to the domain means that parked vCPUs
+     * may be unparked and run. They are, if any, in the domain's parked_vcpus
+     * list, so we want to go through that and unpark them (so they can try
+     * to get some budget).
+     *
+     * Touching the list requires the budget_lock, which we hold. Let's
+     * therefore put everyone in that list in another, temporary list, which
+     * then the caller will traverse, unparking the vCPUs it finds there.
+     *
+     * In fact, we can't do the actual unparking here, because that requires
+     * taking the runqueue lock of the vCPUs being unparked, and we can't
+     * take any runqueue locks while we hold a budget_lock.
+     */
+    if ( sdom->budget > 0 )
+        list_splice_init(&sdom->parked_vcpus, parked);
+
+    spin_unlock(&sdom->budget_lock);
+}
+
+static void
+unpark_parked_vcpus(const struct scheduler *ops, struct list_head *vcpus)
+{
+    struct csched2_vcpu *svc, *tmp;
+    spinlock_t *lock;
+
+    list_for_each_entry_safe(svc, tmp, vcpus, parked_elem)
+    {
+        unsigned long flags;
+        s_time_t now;
+
+        lock = vcpu_schedule_lock_irqsave(svc->vcpu, &flags);
+
+        __clear_bit(_VPF_parked, &svc->vcpu->pause_flags);
+        if ( unlikely(svc->flags & CSFLAG_scheduled) )
+        {
+            /*
+             * We end here if a budget replenishment arrived between
+             * csched2_schedule() (and, in particular, after a call to
+             * vcpu_grab_budget() that returned false), and
+             * context_saved(). By setting __CSFLAG_delayed_runq_add,
+             * we tell context_saved() to put the vCPU back in the
+             * runqueue, from where it will compete with the others
+             * for the newly replenished budget.
+             */
+            ASSERT( svc->rqd != NULL );
+            ASSERT( c2rqd(ops, svc->vcpu->processor) == svc->rqd );
+            __set_bit(__CSFLAG_delayed_runq_add, &svc->flags);
+        }
+        else if ( vcpu_runnable(svc->vcpu) )
+        {
+            /*
+             * The vCPU should go back to the runqueue, and compete for
+             * the newly replenished budget, but only if it is actually
+             * runnable (and was therefore offline only because of the
+             * lack of budget).
+             */
+            now = NOW();
+            update_load(ops, svc->rqd, svc, 1, now);
+            runq_insert(ops, svc);
+            runq_tickle(ops, svc, now);
+        }
+        list_del_init(&svc->parked_elem);
+
+        vcpu_schedule_unlock_irqrestore(lock, flags, svc->vcpu);
+    }
+}
+
+static inline void do_replenish(struct csched2_dom *sdom)
+{
+    sdom->next_repl += CSCHED2_BDGT_REPL_PERIOD;
+    sdom->budget += sdom->tot_budget;
+}
+
+static void replenish_domain_budget(void* data)
+{
+    struct csched2_dom *sdom = data;
+    unsigned long flags;
+    s_time_t now;
+    LIST_HEAD(parked);
+
+    spin_lock_irqsave(&sdom->budget_lock, flags);
+
+    now = NOW();
+
+    /*
+     * Let's do the replenishment. Note, though, that a domain may overrun,
+     * which means the budget would have gone below 0 (reasons may be system
+     * overbooking, accounting issues, etc.). It also may happen that we are
+     * handling the replenishment (much) later than we should (reasons may
+     * again be overbooking, or issues with timers).
+     *
+     * Even in cases of overrun or delay, however, we expect that in 99% of
+     * cases, doing just one replenishment will be good enough for being able
+     * to unpark the vCPUs that are waiting for some budget.
+     */
+    do_replenish(sdom);
+
+    /*
+     * And now, the special cases:
+     * 1) if we are late enough to have skipped (at least) one full period,
+     * what we must do is doing more replenishments. Note that, however,
+     * every time we add tot_budget to the budget, we also move next_repl
+     * away by CSCHED2_BDGT_REPL_PERIOD, to make sure the cap is always
+     * respected.
+     */
+    if ( unlikely(sdom->next_repl <= now) )
+    {
+        do
+            do_replenish(sdom);
+        while ( sdom->next_repl <= now );
+    }
+    /*
+     * 2) if we overrun by more than tot_budget, then budget+tot_budget is
+     * still < 0, which means that we can't unpark the vCPUs. Let's bail,
+     * and wait for future replenishments.
+     */
+    if ( unlikely(sdom->budget <= 0) )
+    {
+        spin_unlock_irqrestore(&sdom->budget_lock, flags);
+        goto out;
+    }
+
+    /* Since we do more replenishments, make sure we didn't overshot. */
+    sdom->budget = min(sdom->budget, sdom->tot_budget);
+
+    /*
+     * As above, let's prepare the temporary list, out of the domain's
+     * parked_vcpus list, now that we hold the budget_lock. Then, drop such
+     * lock, and pass the list to the unparking function.
+     */
+    list_splice_init(&sdom->parked_vcpus, &parked);
+
+    spin_unlock_irqrestore(&sdom->budget_lock, flags);
+
+    unpark_parked_vcpus(sdom->dom->cpupool->sched, &parked);
+
+ out:
+    set_timer(sdom->repl_timer, sdom->next_repl);
+}
+
 #ifndef NDEBUG
 static inline void
 csched2_vcpu_check(struct vcpu *vc)
@@ -1658,6 +2035,9 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
     }
     svc->tickled_cpu = -1;
 
+    svc->budget = STIME_MAX;
+    INIT_LIST_HEAD(&svc->parked_elem);
+
     SCHED_STAT_CRANK(vcpu_alloc);
 
     return svc;
@@ -1754,6 +2134,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     struct csched2_vcpu * const svc = csched2_vcpu(vc);
     spinlock_t *lock = vcpu_schedule_lock_irq(vc);
     s_time_t now = NOW();
+    LIST_HEAD(were_parked);
 
     BUG_ON( !is_idle_vcpu(vc) && svc->rqd != c2rqd(ops, vc->processor));
     ASSERT(is_idle_vcpu(vc) || svc->rqd == c2rqd(ops, vc->processor));
@@ -1761,6 +2142,9 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     /* This vcpu is now eligible to be put on the runqueue again */
     __clear_bit(__CSFLAG_scheduled, &svc->flags);
 
+    if ( unlikely(has_cap(svc) && svc->budget > 0) )
+        vcpu_return_budget(svc, &were_parked);
+
     /* If someone wants it on the runqueue, put it there. */
     /*
      * NB: We can get rid of CSFLAG_scheduled by checking for
@@ -1781,6 +2165,8 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
         update_load(ops, svc->rqd, svc, -1, now);
 
     vcpu_schedule_unlock_irq(lock, vc);
+
+    unpark_parked_vcpus(ops, &were_parked);
 }
 
 #define MAX_LOAD (STIME_MAX)
@@ -2483,12 +2869,25 @@ csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom)
     if ( sdom == NULL )
         return NULL;
 
-    /* Initialize credit and weight */
+    sdom->repl_timer = xzalloc(struct timer);
+    if ( sdom->repl_timer == NULL )
+    {
+        xfree(sdom);
+        return NULL;
+    }
+
+    /* Initialize credit, cap and weight */
     INIT_LIST_HEAD(&sdom->sdom_elem);
     sdom->dom = dom;
     sdom->weight = CSCHED2_DEFAULT_WEIGHT;
+    sdom->cap = 0U;
     sdom->nr_vcpus = 0;
 
+    init_timer(sdom->repl_timer, replenish_domain_budget, sdom,
+               cpumask_any(cpupool_domain_cpumask(dom)));
+    spin_lock_init(&sdom->budget_lock);
+    INIT_LIST_HEAD(&sdom->parked_vcpus);
+
     write_lock_irqsave(&prv->lock, flags);
 
     list_add_tail(&sdom->sdom_elem, &csched2_priv(ops)->sdom);
@@ -2524,6 +2923,9 @@ csched2_free_domdata(const struct scheduler *ops, void *data)
 
     write_lock_irqsave(&prv->lock, flags);
 
+    kill_timer(sdom->repl_timer);
+    xfree(sdom->repl_timer);
+
     list_del_init(&sdom->sdom_elem);
 
     write_unlock_irqrestore(&prv->lock, flags);
@@ -2618,11 +3020,12 @@ csched2_runtime(const struct scheduler *ops, int cpu,
         return -1;
 
     /* General algorithm:
-     * 1) Run until snext's credit will be 0
+     * 1) Run until snext's credit will be 0.
      * 2) But if someone is waiting, run until snext's credit is equal
-     * to his
-     * 3) But never run longer than MAX_TIMER or shorter than MIN_TIMER or
-     * the ratelimit time.
+     *    to his.
+     * 3) But, if we are capped, never run more than our budget.
+     * 4) And never run longer than MAX_TIMER or shorter than MIN_TIMER or
+     *    the ratelimit time.
      */
 
     /* Calculate mintime */
@@ -2637,11 +3040,13 @@ csched2_runtime(const struct scheduler *ops, int cpu,
             min_time = ratelimit_min;
     }
 
-    /* 1) Basic time: Run until credit is 0. */
+    /* 1) Run until snext's credit will be 0. */
     rt_credit = snext->credit;
 
-    /* 2) If there's someone waiting whose credit is positive,
-     * run until your credit ~= his */
+    /*
+     * 2) If there's someone waiting whose credit is positive,
+     *    run until your credit ~= his.
+     */
     if ( ! list_empty(runq) )
     {
         struct csched2_vcpu *swait = runq_elem(runq->next);
@@ -2663,14 +3068,22 @@ csched2_runtime(const struct scheduler *ops, int cpu,
      * credit values of MIN,MAX per vcpu, since each vcpu burns credit
      * at a different rate.
      */
-    if (rt_credit > 0)
+    if ( rt_credit > 0 )
         time = c2t(rqd, rt_credit, snext);
     else
         time = 0;
 
-    /* 3) But never run longer than MAX_TIMER or less than MIN_TIMER or
-     * the rate_limit time. */
-    if ( time < min_time)
+    /*
+     * 3) But, if capped, never run more than our budget.
+     */
+    if ( has_cap(snext) )
+        time = snext->budget < time ? snext->budget : time;
+
+    /*
+     * 4) And never run longer than MAX_TIMER or less than MIN_TIMER or
+     *    the rate_limit time.
+     */
+    if ( time < min_time )
     {
         time = min_time;
         SCHED_STAT_CRANK(runtime_min_timer);
@@ -2693,7 +3106,7 @@ runq_candidate(struct csched2_runqueue_data *rqd,
                int cpu, s_time_t now,
                unsigned int *skipped)
 {
-    struct list_head *iter;
+    struct list_head *iter, *temp;
     struct csched2_vcpu *snext = NULL;
     struct csched2_private *prv = csched2_priv(per_cpu(scheduler, cpu));
     bool yield = false, soft_aff_preempt = false;
@@ -2778,7 +3191,7 @@ runq_candidate(struct csched2_runqueue_data *rqd,
         snext = csched2_vcpu(idle_vcpu[cpu]);
 
  check_runq:
-    list_for_each( iter, &rqd->runq )
+    list_for_each_safe( iter, temp, &rqd->runq )
     {
         struct csched2_vcpu * svc = list_entry(iter, struct csched2_vcpu, runq_elem);
 
@@ -2826,11 +3239,13 @@ runq_candidate(struct csched2_runqueue_data *rqd,
         }
 
         /*
-         * If the next one on the list has more credit than current
-         * (or idle, if current is not runnable), or if current is
-         * yielding, choose it.
+         * If the one in the runqueue has more credit than current (or idle,
+         * if current is not runnable), or if current is yielding, and also
+         * if the one in runqueue either is not capped, or is capped but has
+         * some budget, then choose it.
          */
-        if ( yield || svc->credit > snext->credit )
+        if ( (yield || svc->credit > snext->credit) &&
+             (!has_cap(svc) || vcpu_grab_budget(svc)) )
             snext = svc;
 
         /* In any case, if we got this far, break. */
@@ -2857,6 +3272,13 @@ runq_candidate(struct csched2_runqueue_data *rqd,
     if ( unlikely(snext->tickled_cpu != -1 && snext->tickled_cpu != cpu) )
         SCHED_STAT_CRANK(tickled_cpu_overridden);
 
+    /*
+     * If snext is from a capped domain, it must have budget (or it
+     * wouldn't have been in the runq). If it is not, it'd be STIME_MAX,
+     * which still is >= 0.
+     */
+    ASSERT(snext->budget >= 0);
+
     return snext;
 }
 
@@ -2914,10 +3336,18 @@ csched2_schedule(
                     (unsigned char *)&d);
     }
 
-    /* Update credits */
+    /* Update credits (and budget, if necessary). */
     burn_credits(rqd, scurr, now);
 
     /*
+     *  Below 0, means that we are capped and we have overrun our  budget.
+     *  Let's try to get some more but, if we fail (e.g., because of the
+     *  other running vcpus), we will be parked.
+     */
+    if ( unlikely(scurr->budget <= 0) )
+        vcpu_grab_budget(scurr);
+
+    /*
      * Select next runnable local VCPU (ie top of local runq).
      *
      * If the current vcpu is runnable, and has higher credit than
@@ -3051,6 +3481,9 @@ csched2_dump_vcpu(struct csched2_private *prv, struct csched2_vcpu *svc)
 
     printk(" credit=%" PRIi32" [w=%u]", svc->credit, svc->weight);
 
+    if ( has_cap(svc) )
+        printk(" budget=%"PRI_stime, svc->budget);
+
     printk(" load=%"PRI_stime" (~%"PRI_stime"%%)", svc->avgload,
            (svc->avgload * 100) >> prv->load_precision_shift);
 
@@ -3138,9 +3571,10 @@ csched2_dump(const struct scheduler *ops)
 
         sdom = list_entry(iter_sdom, struct csched2_dom, sdom_elem);
 
-        printk("\tDomain: %d w %d v %d\n",
+        printk("\tDomain: %d w %d c %u v %d\n",
                sdom->dom->domain_id,
                sdom->weight,
+               sdom->cap,
                sdom->nr_vcpus);
 
         for_each_vcpu( sdom->dom, v )
@@ -3360,12 +3794,14 @@ csched2_init(struct scheduler *ops)
            XENLOG_INFO " load_window_shift: %d\n"
            XENLOG_INFO " underload_balance_tolerance: %d\n"
            XENLOG_INFO " overload_balance_tolerance: %d\n"
-           XENLOG_INFO " runqueues arrangement: %s\n",
+           XENLOG_INFO " runqueues arrangement: %s\n"
+           XENLOG_INFO " cap enforcement granularity: %dms\n",
            opt_load_precision_shift,
            opt_load_window_shift,
            opt_underload_balance_tolerance,
            opt_overload_balance_tolerance,
-           opt_runqueue_str[opt_runqueue]);
+           opt_runqueue_str[opt_runqueue],
+           opt_cap_period);
 
     if ( opt_load_precision_shift < LOADAVG_PRECISION_SHIFT_MIN )
     {
@@ -3383,6 +3819,13 @@ csched2_init(struct scheduler *ops)
     printk(XENLOG_INFO "load tracking window length %llu ns\n",
            1ULL << opt_load_window_shift);
 
+    if ( CSCHED2_BDGT_REPL_PERIOD < CSCHED2_MIN_TIMER )
+    {
+        printk("WARNING: %s: opt_cap_period %d too small, resetting\n",
+               __func__, opt_cap_period);
+        opt_cap_period = 10; /* ms */
+    }
+
     /*
      * Basically no CPU information is available at this point; just
      * set up basic structures, and a callback when the CPU info is
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 5828a01..9ab8585 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -792,6 +792,9 @@ static inline struct domain *next_domain_in_cpupool(
  /* VCPU is being reset. */
 #define _VPF_in_reset        7
 #define VPF_in_reset         (1UL<<_VPF_in_reset)
+/* VCPU is parked. */
+#define _VPF_parked          8
+#define VPF_parked           (1UL<<_VPF_parked)
 
 static inline int vcpu_runnable(struct vcpu *v)
 {