From patchwork Thu Jun  8 12:08:46 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dario Faggioli <dario.faggioli@citrix.com>
X-Patchwork-Id: 9774531
Return-Path: <xen-devel-bounces@lists.xen.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	338B1601C3 for <patchwork-xen-devel@patchwork.kernel.org>;
	Thu,  8 Jun 2017 12:11:34 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 203FF2853E
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Thu,  8 Jun 2017 12:11:34 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 14E1828544; Thu,  8 Jun 2017 12:11:34 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-4.1 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_MED,T_DKIM_INVALID autolearn=ham version=3.3.1
Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120])
	(using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 7B8C92853E
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Thu,  8 Jun 2017 12:11:32 +0000 (UTC)
Received: from localhost ([127.0.0.1] helo=lists.xenproject.org)
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <xen-devel-bounces@lists.xen.org>)
	id 1dIwFC-0002wF-5i; Thu, 08 Jun 2017 12:08:54 +0000
Received: from mail6.bemta6.messagelabs.com ([193.109.254.103])
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <raistlin.df@gmail.com>) id 1dIwFA-0002w5-GW
	for xen-devel@lists.xenproject.org; Thu, 08 Jun 2017 12:08:52 +0000
Received: from [85.158.143.35] by server-3.bemta-6.messagelabs.com id
	47/07-03058-35E39395; Thu, 08 Jun 2017 12:08:51 +0000
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFmpjleJIrShJLcpLzFFi42Lxqg1y1g2ys4w
	0mLrFwOL7lslMDowehz9cYQlgjGLNzEvKr0hgzVh+dgV7waXPjBXvG9axNzD+2sbYxcjFISQw
	jVFi0s/rzCAOi8AaVok/vRvYQBwJgUusEu+mLmPvYuQAcuIkvs6q7WLkBDIrJPo+7mMHsYUEV
	CRubl/FBDHpB6PEtjOLwRLCAnoSR47+gLJtJDZfussCYrMJGEi82bGXFcQWEVCSuLdqMlgzs8
	AHRonuDxvZQBIsAqoSz15cALN5Bbwk/s3tBWvmFPCR+NI9iRFis7fErOnXweKiAnISKy+3sEL
	UC0qcnPmEBeRoZgFNifW79EHCzALyEtvfzmGewCgyC0nVLISqWUiqFjAyr2JUL04tKkst0jXW
	SyrKTM8oyU3MzNE1NDDTy00tLk5MT81JTCrWS87P3cQIjAEGINjB2PHP6RCjJAeTkiivooBlp
	BBfUn5KZUZicUZ8UWlOavEhRg0ODoEJZ+dOZ5JiycvPS1WS4L1hA1QnWJSanlqRlpkDjFKYUg
	kOHiUR3uMgad7igsTc4sx0iNQpRmOODavXf2HimHRg+xcmIbBJUuK8D0FKBUBKM0rz4AbBksc
	lRlkpYV5GoDOFeApSi3IzS1DlXzGKczAqCfPy2wJN4cnMK4Hb9wroFCagU5a8swA5pSQRISXV
	wJi1yIP7XfplMzHmVSGBzype3VgzR31m7kwjiTTPjSucvpnOjtkey3rbtj/F/F16lU566kRvO
	9k1v+7e1hH413pjcU9Owstyzs3GTnrcjDU7enoPKR3bfmPzsoh1Z/Xm7W7PXvPrfzLb063yQp
	9efI4NVQvmdXo4b+e7kuLmh213eGRVfwctt1diKc5INNRiLipOBAAVbtGMGQMAAA==
X-Env-Sender: raistlin.df@gmail.com
X-Msg-Ref: server-12.tower-21.messagelabs.com!1496923730!72894858!1
X-Originating-IP: [74.125.82.67]
X-SpamReason: No, hits=0.0 required=7.0 tests=
X-StarScan-Received: 
X-StarScan-Version: 9.4.19; banners=-,-,-
X-VirusChecked: Checked
Received: (qmail 8441 invoked from network); 8 Jun 2017 12:08:50 -0000
Received: from mail-wm0-f67.google.com (HELO mail-wm0-f67.google.com)
	(74.125.82.67)
	by server-12.tower-21.messagelabs.com with AES128-GCM-SHA256
	encrypted SMTP; 8 Jun 2017 12:08:50 -0000
Received: by mail-wm0-f67.google.com with SMTP id 70so6985243wme.1
	for <xen-devel@lists.xenproject.org>;
	Thu, 08 Jun 2017 05:08:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=sender:subject:from:to:cc:date:message-id:in-reply-to:references
	:user-agent:mime-version:content-transfer-encoding;
	bh=UVxg/64gNioq/vGW1mIsmPzSTwhIs5qC0pZ6xxJx7IU=;
	b=tjtiQ/90UOasKi0lMIkdtsNyxUsqjfzF3/eoK+FrIp2IuWzOE6TAvctTNT3eNcY4E8
	h6wgZciL4ggzXBDTvf8Fl1xASngJ7z/x+JvvkAlL5UNypyqdZOPpRwU6ffajzm8TjTAN
	khMH/5205Z92LWk0FrumaZN7SSCI15fc7siuSnr17QVYHfn+3yLF+dPi9EK/Fj3fM9e6
	mXRTnYPWguWMynxxUmaFtwH/ry8Px28WdaXijnxm54vEmgGGrr06FRWGxNBE5CEUSI6J
	mG1SVz4fPJ7Zs6gGBNRgI7ysbMnzZwK8tLbpYj/q7XK+RLoz/9GPq7ob7xi5YLnwz66M
	wDVQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:sender:subject:from:to:cc:date:message-id
	:in-reply-to:references:user-agent:mime-version
	:content-transfer-encoding;
	bh=UVxg/64gNioq/vGW1mIsmPzSTwhIs5qC0pZ6xxJx7IU=;
	b=stOV6tj5VSE8i9RcU9Ssf1zTD+B/uCXrtNny8D3oeNxzFbnWa5yDzN2InqH8Ry/0GM
	P0PsomWe0Vj7xNcBphrErWT/l5xQLFvvJglY0gADREDlLnuJPvqd7F6v90JDaj2bOEjw
	8D420Gep1VUt06cpS5rbAnGGDXjZiMw9pTdzMgnRdpaKx7ePKGlojObXe4k4EPadv9SV
	ol0yyCg2y9sNcHiriPuuSj0HOuahRYpjRrIqTmwDqDR4MMz0wZvlM1kTKGyzHR6XpCwg
	oUIk/zk26jgXb8Jddu6MZH0T/TkHieaGWHMksvNRtxyC1lJvC3EbTK/DgXzFHKDav15x
	Bjlg==
X-Gm-Message-State: AODbwcBxy+h/i36wMp11m6PbSSffxN43txYcGRu8Yr6byf5yCWfxQD6p
	6kvr3a5+WQdt4A==
X-Received: by 10.28.175.84 with SMTP id y81mr3477021wme.25.1496923729216;
	Thu, 08 Jun 2017 05:08:49 -0700 (PDT)
Received: from Solace.fritz.box ([80.66.223.25])
	by smtp.gmail.com with ESMTPSA id
	k19sm7615881wmg.9.2017.06.08.05.08.47
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Thu, 08 Jun 2017 05:08:48 -0700 (PDT)
From: Dario Faggioli <dario.faggioli@citrix.com>
To: xen-devel@lists.xenproject.org
Date: Thu, 08 Jun 2017 14:08:46 +0200
Message-ID: <149692372627.9605.8252407697848997058.stgit@Solace.fritz.box>
In-Reply-To: <149692186557.9605.11625777539060264052.stgit@Solace.fritz.box>
References: <149692186557.9605.11625777539060264052.stgit@Solace.fritz.box>
User-Agent: StGit/0.17.1-dirty
MIME-Version: 1.0
Cc: Wei Liu <wei.liu2@citrix.com>,
	George Dunlap <george.dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Anshul Makkar <anshul.makkar@citrix.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>, Jan Beulich <jbeulich@suse.com>
Subject: [Xen-devel] [PATCH 1/4] xen: credit2: implement utilization cap
X-BeenThere: xen-devel@lists.xen.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Xen developer discussion <xen-devel.lists.xen.org>
List-Unsubscribe: <https://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <https://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
X-Virus-Scanned: ClamAV using ClamSMTP

This commit implements the Xen part of the cap mechanism for
Credit2.

A cap is how much, in terms of % of physical CPU time, a domain
can execute at most.

For instance, a domain that must not use more than 1/4 of one
physical CPU, must have a cap of 25%; one that must not use more
than 1+1/2 of physical CPU time, must be given a cap of 150%.

Caps are per domain, so it is all a domain's vCPUs, cumulatively,
that will be forced to execute no more than the decided amount.

This is implemented by giving each domain a 'budget', and using
a (per-domain again) periodic timer. Values of budget and 'period'
are chosen so that budget/period is equal to the cap itself.

Budget is burned by the domain's vCPUs, in a similar way to how
credits are.

When a domain runs out of budget, its vCPUs can't run any longer.
They can gain, when the budget is replenishment by the timer, which
event happens once every period.

Blocking the vCPUs because of lack of budget happens by
means of a new (_VPF_parked) pause flag, so that, e.g.,
vcpu_runnable() still works. This is similar to what is
done in sched_rtds.c, as opposed to what happens in
sched_credit.c, where vcpu_pause() and vcpu_unpause()
(which means, among other things, more overhead).

Note that xenalyze and tools/xentrace/format are also modified,
to keep them updated with one modified event.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Anshul Makkar<anshul.makkar@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
---
 tools/xentrace/formats     |    2 
 tools/xentrace/xenalyze.c  |   10 +
 xen/common/sched_credit2.c |  470 +++++++++++++++++++++++++++++++++++++++++---
 xen/include/xen/sched.h    |    3 
 4 files changed, 445 insertions(+), 40 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index 8b31780..142b0cf 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -51,7 +51,7 @@
 
 0x00022201  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tick
 0x00022202  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_pos       [ dom:vcpu = 0x%(1)08x, pos = %(2)d]
-0x00022203  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit burn    [ dom:vcpu = 0x%(1)08x, credit = %(2)d, delta = %(3)d ]
+0x00022203  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit burn    [ dom:vcpu = 0x%(1)08x, credit = %(2)d, budget = %(3)d, delta = %(4)d ]
 0x00022204  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit_add
 0x00022205  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tickle_check   [ dom:vcpu = 0x%(1)08x, credit = %(2)d ]
 0x00022206  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tickle         [ cpu = %(1)d ]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index fa608ad..c16c02d 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7680,12 +7680,14 @@ void sched_process(struct pcpu_info *p)
             if(opt.dump_all) {
                 struct {
                     unsigned int vcpuid:16, domid:16;
-                    int credit, delta;
+                    int credit, budget, delta;
                 } *r = (typeof(r))ri->d;
 
-                printf(" %s csched2:burn_credits d%uv%u, credit = %d, delta = %d\n",
-                       ri->dump_header, r->domid, r->vcpuid,
-                       r->credit, r->delta);
+                printf(" %s csched2:burn_credits d%uv%u, credit = %d, ",
+                       ri->dump_header, r->domid, r->vcpuid, r->credit);
+                if ( r->budget != INT_MIN )
+                    printf("budget = %d, ", r->budget);
+                printf("delta = %d\n", r->delta);
             }
             break;
         case TRC_SCHED_CLASS_EVT(CSCHED2, 5): /* TICKLE_CHECK      */
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 126417c..ba4bf4b 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -92,6 +92,82 @@
  */
 
 /*
+ * Utilization cap:
+ *
+ * Setting an pCPU utilization cap for a domain means the following:
+ *
+ * - a domain can have a cap, expressed in terms of % of physical CPU time.
+ *   A domain that must not use more than 1/4 of _one_ physical CPU, will
+ *   be given a cap of 25%; a domain that must not use more than 1+1/2 of
+ *   physical CPU time, will be given a cap of 150%;
+ *
+ * - caps are per-domain (not per-vCPU). If a domain has only 1 vCPU, and
+ *   a 40% cap, that one vCPU will use 40% of one pCPU. If a somain has 4
+ *   vCPUs, and a 200% cap, all its 4 vCPUs are allowed to run for (the
+ *   equivalent of) 100% time on 2 pCPUs. How much each of the various 4
+ *   vCPUs will get, is unspecified (will depend on various aspects: workload,
+ *   system load, etc.).
+ *
+ * For implementing this, we use the following approach:
+ *
+ * - each domain is given a 'budget', an each domain has a timer, which
+ *   replenishes the domain's budget periodically. The budget is the amount
+ *   of time the vCPUs of the domain can use every 'period';
+ *
+ * - the period is CSCHED2_BDGT_REPL_PERIOD, and is the same for all domains
+ *   (but each domain has its own timer; so the all are periodic by the same
+ *   period, but replenishment of the budgets of the various domains, at
+ *   periods boundaries, are not synchronous);
+ *
+ * - when vCPUs run, they consume budget. When they don't run, they don't
+ *   consume budget. If there is no budget left for the domain, no vCPU of
+ *   that domain can run. If a vCPU tries to run and finds that there is no
+ *   budget, it blocks.
+ *   Budget never expires, so at whatever time a vCPU wants to run, it can
+ *   check the domain's budget, and if there is some, it can use it.
+ *
+ * - budget is replenished to the top of the capacity for the domain once
+ *   per period. Even if there was some leftover budget from previous period,
+ *   though, the budget after a replenishment will always be at most equal
+ *   to the total capacify of the domain ('tot_budget');
+ *
+ * - when a budget replenishment occurs, if there are vCPUs that had been
+ *   blocked because of lack of budget, they'll be unblocked, and they will
+ *   (potentially) be able to run again.
+ *
+ * Finally, some even more implementation related detail:
+ *
+ * - budget is stored in a domain-wide pool. vCPUs of the domain that want
+ *   to run go to such pool, and grub some. When they do so, the amount
+ *   they grabbed is _immediately_ removed from the pool. This happens in
+ *   vcpu_try_to_get_budget();
+ *
+ * - when vCPUs stop running, if they've not consumed all the budget they
+ *   took, the leftover is put back in the pool. This happens in
+ *   vcpu_give_budget_back();
+ *
+ * - the above means that a vCPU can find out that there is no budget and
+ *   block, not only if the cap has actually been reached (for this period),
+ *   but also if some other vCPUs, in order to run, have grabbed a certain
+ *   quota of budget, no matter whether they've already used it all or not.
+ *   A vCPU blocking because (any form of) lack of budget is said to be
+ *   "parked", and such blocking happens in park_vcpu();
+ *
+ * - when a vCPU stops running, and puts back some budget in the domain pool,
+ *   we need to check whether there is someone which has been parked and that
+ *   can be unparked. This happens in unpark_parked_vcpus(), called from
+ *   csched2_context_saved();
+ *
+ * - of course, unparking happens also as a consequene of the domain's budget
+ *   being replenished by the periodic timer. This also occurs by means of
+ *   calling csched2_context_saved() (but from repl_sdom_budget());
+ *
+ * - parked vCPUs of a domain are kept in a (per-domain) list, called
+ *   'parked_vcpus'). Manipulation of the list and of the domain-wide budget
+ *   pool, must occur only when holding the 'budget_lock'.
+ */
+
+/*
  * Locking:
  *
  * - runqueue lock
@@ -112,18 +188,29 @@
  *     runqueue each cpu is;
  *  + serializes the operation of changing the weights of domains;
  *
+ * - Budget lock
+ *  + it is per-domain;
+ *  + protects, in domains that have an utilization cap;
+ *   * manipulation of the total budget of the domain (as it is shared
+ *     among all vCPUs of the domain),
+ *   * manipulation of the list of vCPUs that are blocked waiting for
+ *     some budget to be available.
+ *
  * - Type:
  *  + runqueue locks are 'regular' spinlocks;
  *  + the private scheduler lock can be an rwlock. In fact, data
  *    it protects is modified only during initialization, cpupool
  *    manipulation and when changing weights, and read in all
- *    other cases (e.g., during load balancing).
+ *    other cases (e.g., during load balancing);
+ *  + budget locks are 'regular' spinlocks.
  *
  * Ordering:
  *  + tylock must be used when wanting to take a runqueue lock,
  *    if we already hold another one;
  *  + if taking both a runqueue lock and the private scheduler
- *    lock is, the latter must always be taken for first.
+ *    lock is, the latter must always be taken for first;
+ *  + if taking both a runqueue lock and a budget lock, the former
+ *    must always be taken for first.
  */
 
 /*
@@ -164,6 +251,8 @@
 #define CSCHED2_CREDIT_RESET         0
 /* Max timer: Maximum time a guest can be run for. */
 #define CSCHED2_MAX_TIMER            CSCHED2_CREDIT_INIT
+/* Period of the cap replenishment timer. */
+#define CSCHED2_BDGT_REPL_PERIOD     ((opt_cap_period)*MILLISECS(1))
 
 /*
  * Flags
@@ -293,6 +382,14 @@ static int __read_mostly opt_underload_balance_tolerance = 0;
 integer_param("credit2_balance_under", opt_underload_balance_tolerance);
 static int __read_mostly opt_overload_balance_tolerance = -3;
 integer_param("credit2_balance_over", opt_overload_balance_tolerance);
+/*
+ * Domains subject to a cap, receive a replenishment of their runtime budget
+ * once every opt_cap_period interval. Default is 10 ms. The amount of budget
+ * they receive depends on their cap. For instance, a domain with a 50% cap
+ * will receive 50% of 10 ms, so 5 ms.
+ */
+static unsigned int __read_mostly opt_cap_period = 10;    /* ms */
+integer_param("credit2_cap_period_ms", opt_cap_period);
 
 /*
  * Runqueue organization.
@@ -408,6 +505,10 @@ struct csched2_vcpu {
     unsigned int residual;
 
     int credit;
+
+    s_time_t budget;
+    struct list_head parked_elem;      /* On the parked_vcpus list   */
+
     s_time_t start_time; /* When we were scheduled (used for credit) */
     unsigned flags;      /* 16 bits doesn't seem to play well with clear_bit() */
     int tickled_cpu;     /* cpu tickled for picking us up (-1 if none) */
@@ -425,7 +526,15 @@ struct csched2_vcpu {
 struct csched2_dom {
     struct list_head sdom_elem;
     struct domain *dom;
+
+    spinlock_t budget_lock;
+    struct timer repl_timer;
+    s_time_t next_repl;
+    s_time_t budget, tot_budget;
+    struct list_head parked_vcpus;
+
     uint16_t weight;
+    uint16_t cap;
     uint16_t nr_vcpus;
 };
 
@@ -460,6 +569,12 @@ static inline struct csched2_runqueue_data *c2rqd(const struct scheduler *ops,
     return &csched2_priv(ops)->rqd[c2r(ops, cpu)];
 }
 
+/* Does the domain of this vCPU have a cap? */
+static inline bool has_cap(const struct csched2_vcpu *svc)
+{
+    return svc->budget != STIME_MAX;
+}
+
 /*
  * Hyperthreading (SMT) support.
  *
@@ -1354,7 +1469,16 @@ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now,
          * that the credit it has spent so far get accounted.
          */
         if ( svc->vcpu == curr_on_cpu(svc_cpu) )
+        {
             burn_credits(rqd, svc, now);
+            /*
+             * And, similarly, in case it has run out of budget, as a
+             * consequence of this round of accounting, we also must inform
+             * its pCPU that it's time to park it, and pick up someone else.
+             */
+            if ( unlikely(svc->budget <= 0) )
+                tickle_cpu(svc_cpu, rqd);
+        }
 
         start_credit = svc->credit;
 
@@ -1410,27 +1534,35 @@ void burn_credits(struct csched2_runqueue_data *rqd,
 
     delta = now - svc->start_time;
 
-    if ( likely(delta > 0) )
-    {
-        SCHED_STAT_CRANK(burn_credits_t2c);
-        t2c_update(rqd, delta, svc);
-        svc->start_time = now;
-    }
-    else if ( delta < 0 )
+    if ( unlikely(delta <= 0) )
     {
-        d2printk("WARNING: %s: Time went backwards? now %"PRI_stime" start_time %"PRI_stime"\n",
-                 __func__, now, svc->start_time);
+        if ( unlikely(delta < 0) )
+            d2printk("WARNING: %s: Time went backwards? now %"PRI_stime
+                     " start_time %"PRI_stime"\n", __func__, now,
+                     svc->start_time);
+        goto out;
     }
 
+    SCHED_STAT_CRANK(burn_credits_t2c);
+    t2c_update(rqd, delta, svc);
+
+    if ( unlikely(svc->budget != STIME_MAX) )
+        svc->budget -= delta;
+
+    svc->start_time = now;
+
+ out:
     if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned vcpu:16, dom:16;
-            int credit, delta;
+            int credit, budget;
+            int delta;
         } d;
         d.dom = svc->vcpu->domain->domain_id;
         d.vcpu = svc->vcpu->vcpu_id;
         d.credit = svc->credit;
+        d.budget = has_cap(svc) ?  svc->budget : INT_MIN;
         d.delta = delta;
         __trace_var(TRC_CSCHED2_CREDIT_BURN, 1,
                     sizeof(d),
@@ -1438,6 +1570,217 @@ void burn_credits(struct csched2_runqueue_data *rqd,
     }
 }
 
+/*
+ * Budget-related code.
+ */
+
+static void park_vcpu(struct csched2_vcpu *svc)
+{
+    struct vcpu *v = svc->vcpu;
+
+    ASSERT(spin_is_locked(&svc->sdom->budget_lock));
+
+    /*
+     * It was impossible to find budget for this vCPU, so it has to be
+     * "parked". This implies it is not runnable, so we mark it as such in
+     * its pause_flags. If the vCPU is currently scheduled (which means we
+     * are here after being called from within csched_schedule()), flagging
+     * is enough, as we'll choose someone else, and then context_saved()
+     * will take care of updating the load properly.
+     *
+     * If, OTOH, the vCPU is sitting in the runqueue (which means we are here
+     * after being called from within runq_candidate()), we must go all the
+     * way down to taking it out of there, and updating the load accordingly.
+     *
+     * In both cases, we also add it to the list of parked vCPUs of the domain.
+     */
+    __set_bit(_VPF_parked, &v->pause_flags);
+    if ( vcpu_on_runq(svc) )
+    {
+        runq_remove(svc);
+        update_load(svc->sdom->dom->cpupool->sched, svc->rqd, svc, -1, NOW());
+    }
+    list_add(&svc->parked_elem, &svc->sdom->parked_vcpus);
+}
+
+static bool vcpu_try_to_get_budget(struct csched2_vcpu *svc)
+{
+    struct csched2_dom *sdom = svc->sdom;
+    unsigned int cpu = svc->vcpu->processor;
+
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
+
+    if ( svc->budget > 0 )
+        return true;
+
+    /* budget_lock nests inside runqueue lock. */
+    spin_lock(&sdom->budget_lock);
+
+    /*
+     * Here, svc->budget is <= 0 (as, if it was > 0, we'd have taken the if
+     * above!). That basically means the vCPU has overrun a bit --because of
+     * various reasons-- and we want to take that into account. With the +=,
+     * we are actually subtracting the amount of budget the vCPU has
+     * overconsumed, from the total domain budget.
+     */
+    sdom->budget += svc->budget;
+
+    if ( sdom->budget > 0 )
+    {
+        svc->budget = sdom->budget;
+        sdom->budget = 0;
+    }
+    else
+    {
+        svc->budget = 0;
+        park_vcpu(svc);
+    }
+
+    spin_unlock(&sdom->budget_lock);
+
+    return svc->budget > 0;
+}
+
+static void
+vcpu_give_budget_back(struct csched2_vcpu *svc, struct list_head *parked)
+{
+    struct csched2_dom *sdom = svc->sdom;
+    unsigned int cpu = svc->vcpu->processor;
+
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
+    ASSERT(list_empty(parked));
+
+    /* budget_lock nests inside runqueue lock. */
+    spin_lock(&sdom->budget_lock);
+
+    /*
+     * The vCPU is stopping running (e.g., because it's blocking, or it has
+     * been preempted). If it hasn't consumed all the budget it got when,
+     * starting to run, put that remaining amount back in the domain's budget
+     * pool.
+     */
+    sdom->budget += svc->budget;
+    svc->budget = 0;
+
+    /*
+     * Making budget available again to the domain means that parked vCPUs
+     * may be unparked and run. They are, if any, in the domain's parked_vcpus
+     * list, so we want to go through that and unpark them (so they can try
+     * to get some budget).
+     *
+     * Touching the list requires the budget_lock, which we hold. Let's
+     * therefore put everyone in that list in another, temporary list, which
+     * then the caller will traverse, unparking the vCPUs it finds there.
+     *
+     * In fact, we can't do the actual unparking here, because that requires
+     * taking the runqueue lock of the vCPUs being unparked, and we can't
+     * take any runqueue locks while we hold a budget_lock.
+     */
+    if ( sdom->budget > 0 )
+        list_splice_init(&sdom->parked_vcpus, parked);
+
+    spin_unlock(&sdom->budget_lock);
+}
+
+static void
+unpark_parked_vcpus(const struct scheduler *ops, struct list_head *vcpus)
+{
+    struct csched2_vcpu *svc, *tmp;
+    spinlock_t *lock;
+
+    list_for_each_entry_safe(svc, tmp, vcpus, parked_elem)
+    {
+        unsigned long flags;
+        s_time_t now;
+
+        lock = vcpu_schedule_lock_irqsave(svc->vcpu, &flags);
+
+        __clear_bit(_VPF_parked, &svc->vcpu->pause_flags);
+        if ( unlikely(svc->flags & CSFLAG_scheduled) )
+        {
+            /*
+             * We end here if a budget replenishment arrived between
+             * csched2_schedule() (and, in particular, after a call to
+             * vcpu_try_to_get_budget() that returned false), and
+             * context_saved(). By setting __CSFLAG_delayed_runq_add,
+             * we tell context_saved() to put the vCPU back in the
+             * runqueue, from where it will compete with the others
+             * for the newly replenished budget.
+             */
+            ASSERT( svc->rqd != NULL );
+            ASSERT( c2rqd(ops, svc->vcpu->processor) == svc->rqd );
+            __set_bit(__CSFLAG_delayed_runq_add, &svc->flags);
+        }
+        else if ( vcpu_runnable(svc->vcpu) )
+        {
+            /*
+             * The vCPU should go back to the runqueue, and compete for
+             * the newly replenished budget, but only if it is actually
+             * runnable (and was therefore offline only because of the
+             * lack of budget).
+             */
+            now = NOW();
+            update_load(ops, svc->rqd, svc, 1, now);
+            runq_insert(ops, svc);
+            runq_tickle(ops, svc, now);
+        }
+        list_del_init(&svc->parked_elem);
+
+        vcpu_schedule_unlock_irqrestore(lock, flags, svc->vcpu);
+    }
+}
+
+static void repl_sdom_budget(void* data)
+{
+    struct csched2_dom *sdom = data;
+    unsigned long flags;
+    s_time_t now;
+    LIST_HEAD(parked);
+
+    spin_lock_irqsave(&sdom->budget_lock, flags);
+
+    /*
+     * It is possible that the domain overrun, and that the budget hence went
+     * below 0 (reasons may be system overbooking, issues in or too coarse
+     * runtime accounting, etc.). In particular, if we overrun by more than
+     * tot_budget, then budget+tot_budget would still be < 0, which in turn
+     * means that, despite replenishment, there's still no budget for unarking
+     * and running vCPUs.
+     *
+     * It is also possible that we are handling the replenishment much later
+     * than expected (reasons may again be overbooking, or issues with timers).
+     * If we are more than CSCHED2_BDGT_REPL_PERIOD late, this means we have
+     * basically skipped (at least) one replenishment.
+     *
+     * We deal with both the issues here, by, basically, doing more than just
+     * one replenishment. Note, however, that every time we add tot_budget
+     * to the budget, we also move next_repl away by CSCHED2_BDGT_REPL_PERIOD.
+     * This guarantees we always respect the cap.
+     */
+    now = NOW();
+    do
+    {
+        sdom->next_repl += CSCHED2_BDGT_REPL_PERIOD;
+        sdom->budget += sdom->tot_budget;
+    }
+    while ( sdom->next_repl <= now || sdom->budget <= 0 );
+    /* We may have done more replenishments: make sure we didn't overshot. */
+    sdom->budget = min(sdom->budget, sdom->tot_budget);
+
+    /*
+     * As above, let's prepare the temporary list, out of the domain's
+     * parked_vcpus list, now that we hold the budget_lock. Then, drop such
+     * lock, and pass the list to the unparking function.
+     */
+    list_splice_init(&sdom->parked_vcpus, &parked);
+
+    spin_unlock_irqrestore(&sdom->budget_lock, flags);
+
+    unpark_parked_vcpus(sdom->dom->cpupool->sched, &parked);
+
+    set_timer(&sdom->repl_timer, sdom->next_repl);
+}
+
 #ifndef NDEBUG
 static inline void
 csched2_vcpu_check(struct vcpu *vc)
@@ -1497,6 +1840,9 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
     }
     svc->tickled_cpu = -1;
 
+    svc->budget = STIME_MAX;
+    INIT_LIST_HEAD(&svc->parked_elem);
+
     SCHED_STAT_CRANK(vcpu_alloc);
 
     return svc;
@@ -1593,6 +1939,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     struct csched2_vcpu * const svc = csched2_vcpu(vc);
     spinlock_t *lock = vcpu_schedule_lock_irq(vc);
     s_time_t now = NOW();
+    LIST_HEAD(were_parked);
 
     BUG_ON( !is_idle_vcpu(vc) && svc->rqd != c2rqd(ops, vc->processor));
     ASSERT(is_idle_vcpu(vc) || svc->rqd == c2rqd(ops, vc->processor));
@@ -1600,6 +1947,9 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     /* This vcpu is now eligible to be put on the runqueue again */
     __clear_bit(__CSFLAG_scheduled, &svc->flags);
 
+    if ( unlikely(has_cap(svc) && svc->budget > 0) )
+        vcpu_give_budget_back(svc, &were_parked);
+
     /* If someone wants it on the runqueue, put it there. */
     /*
      * NB: We can get rid of CSFLAG_scheduled by checking for
@@ -1620,6 +1970,8 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
         update_load(ops, svc->rqd, svc, -1, now);
 
     vcpu_schedule_unlock_irq(lock, vc);
+
+    unpark_parked_vcpus(ops, &were_parked);
 }
 
 #define MAX_LOAD (STIME_MAX);
@@ -2243,12 +2595,18 @@ csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom)
     if ( sdom == NULL )
         return NULL;
 
-    /* Initialize credit and weight */
+    /* Initialize credit, cap and weight */
     INIT_LIST_HEAD(&sdom->sdom_elem);
     sdom->dom = dom;
     sdom->weight = CSCHED2_DEFAULT_WEIGHT;
+    sdom->cap = 0U;
     sdom->nr_vcpus = 0;
 
+    init_timer(&sdom->repl_timer, repl_sdom_budget, (void*) sdom,
+               cpumask_any(cpupool_domain_cpumask(dom)));
+    spin_lock_init(&sdom->budget_lock);
+    INIT_LIST_HEAD(&sdom->parked_vcpus);
+
     write_lock_irqsave(&prv->lock, flags);
 
     list_add_tail(&sdom->sdom_elem, &csched2_priv(ops)->sdom);
@@ -2284,6 +2642,7 @@ csched2_free_domdata(const struct scheduler *ops, void *data)
 
     write_lock_irqsave(&prv->lock, flags);
 
+    kill_timer(&sdom->repl_timer);
     list_del_init(&sdom->sdom_elem);
 
     write_unlock_irqrestore(&prv->lock, flags);
@@ -2378,11 +2737,12 @@ csched2_runtime(const struct scheduler *ops, int cpu,
         return -1;
 
     /* General algorithm:
-     * 1) Run until snext's credit will be 0
+     * 1) Run until snext's credit will be 0.
      * 2) But if someone is waiting, run until snext's credit is equal
-     * to his
-     * 3) But never run longer than MAX_TIMER or shorter than MIN_TIMER or
-     * the ratelimit time.
+     *    to his.
+     * 3) But, if we are capped, never run more than our budget.
+     * 4) But never run longer than MAX_TIMER or shorter than MIN_TIMER or
+     *    the ratelimit time.
      */
 
     /* Calculate mintime */
@@ -2397,11 +2757,13 @@ csched2_runtime(const struct scheduler *ops, int cpu,
             min_time = ratelimit_min;
     }
 
-    /* 1) Basic time: Run until credit is 0. */
+    /* 1) Run until snext's credit will be 0. */
     rt_credit = snext->credit;
 
-    /* 2) If there's someone waiting whose credit is positive,
-     * run until your credit ~= his */
+    /*
+     * 2) If there's someone waiting whose credit is positive,
+     *    run until your credit ~= his.
+     */
     if ( ! list_empty(runq) )
     {
         struct csched2_vcpu *swait = runq_elem(runq->next);
@@ -2423,14 +2785,22 @@ csched2_runtime(const struct scheduler *ops, int cpu,
      * credit values of MIN,MAX per vcpu, since each vcpu burns credit
      * at a different rate.
      */
-    if (rt_credit > 0)
+    if ( rt_credit > 0 )
         time = c2t(rqd, rt_credit, snext);
     else
         time = 0;
 
-    /* 3) But never run longer than MAX_TIMER or less than MIN_TIMER or
-     * the rate_limit time. */
-    if ( time < min_time)
+    /*
+     * 3) But, if capped, never run more than our budget.
+     */
+    if ( unlikely(has_cap(snext)) )
+        time = snext->budget < time ? snext->budget : time;
+
+    /*
+     * 4) But never run longer than MAX_TIMER or less than MIN_TIMER or
+     *    the rate_limit time.
+     */
+    if ( time < min_time )
     {
         time = min_time;
         SCHED_STAT_CRANK(runtime_min_timer);
@@ -2447,13 +2817,13 @@ csched2_runtime(const struct scheduler *ops, int cpu,
 /*
  * Find a candidate.
  */
-static struct csched2_vcpu *
+static noinline struct csched2_vcpu *
 runq_candidate(struct csched2_runqueue_data *rqd,
                struct csched2_vcpu *scurr,
                int cpu, s_time_t now,
                unsigned int *skipped)
 {
-    struct list_head *iter;
+    struct list_head *iter, *temp;
     struct csched2_vcpu *snext = NULL;
     struct csched2_private *prv = csched2_priv(per_cpu(scheduler, cpu));
     bool yield = __test_and_clear_bit(__CSFLAG_vcpu_yield, &scurr->flags);
@@ -2496,7 +2866,7 @@ runq_candidate(struct csched2_runqueue_data *rqd,
     else
         snext = csched2_vcpu(idle_vcpu[cpu]);
 
-    list_for_each( iter, &rqd->runq )
+    list_for_each_safe( iter, temp, &rqd->runq )
     {
         struct csched2_vcpu * svc = list_entry(iter, struct csched2_vcpu, runq_elem);
 
@@ -2544,11 +2914,13 @@ runq_candidate(struct csched2_runqueue_data *rqd,
         }
 
         /*
-         * If the next one on the list has more credit than current
-         * (or idle, if current is not runnable), or if current is
-         * yielding, choose it.
+         * If the one in the runqueue has more credit than current (or idle,
+         * if current is not runnable), or if current is yielding, and also
+         * if the one in runqueue either is not capped, or is capped but has
+         * some budget, then choose it.
          */
-        if ( yield || svc->credit > snext->credit )
+        if ( (yield || svc->credit > snext->credit) &&
+             (!has_cap(svc) || vcpu_try_to_get_budget(svc)) )
             snext = svc;
 
         /* In any case, if we got this far, break. */
@@ -2575,6 +2947,13 @@ runq_candidate(struct csched2_runqueue_data *rqd,
     if ( unlikely(snext->tickled_cpu != -1 && snext->tickled_cpu != cpu) )
         SCHED_STAT_CRANK(tickled_cpu_overridden);
 
+    /*
+     * If snext is from a capped domain, it must have budget (or it
+     * wouldn't have been in the runq). If it is not, it'd be STIME_MAX,
+     * which still is >= 0.
+     */
+    ASSERT(snext->budget >= 0);
+
     return snext;
 }
 
@@ -2632,10 +3011,18 @@ csched2_schedule(
                     (unsigned char *)&d);
     }
 
-    /* Update credits */
+    /* Update credits (and budget, if necessary). */
     burn_credits(rqd, scurr, now);
 
     /*
+     *  Below 0, means that we are capped and we have overrun our  budget.
+     *  Let's try to get some more but, if we fail (e.g., because of the
+     *  other running vcpus), we will be parked.
+     */
+    if ( unlikely(scurr->budget <= 0) )
+        vcpu_try_to_get_budget(scurr);
+
+    /*
      * Select next runnable local VCPU (ie top of local runq).
      *
      * If the current vcpu is runnable, and has higher credit than
@@ -2769,6 +3156,9 @@ csched2_dump_vcpu(struct csched2_private *prv, struct csched2_vcpu *svc)
 
     printk(" credit=%" PRIi32" [w=%u]", svc->credit, svc->weight);
 
+    if ( has_cap(svc) )
+        printk(" budget=%"PRI_stime, svc->budget);
+
     printk(" load=%"PRI_stime" (~%"PRI_stime"%%)", svc->avgload,
            (svc->avgload * 100) >> prv->load_precision_shift);
 
@@ -2856,9 +3246,10 @@ csched2_dump(const struct scheduler *ops)
 
         sdom = list_entry(iter_sdom, struct csched2_dom, sdom_elem);
 
-        printk("\tDomain: %d w %d v %d\n",
+        printk("\tDomain: %d w %d c %u v %d\n",
                sdom->dom->domain_id,
                sdom->weight,
+               sdom->cap,
                sdom->nr_vcpus);
 
         for_each_vcpu( sdom->dom, v )
@@ -3076,12 +3467,14 @@ csched2_init(struct scheduler *ops)
            XENLOG_INFO " load_window_shift: %d\n"
            XENLOG_INFO " underload_balance_tolerance: %d\n"
            XENLOG_INFO " overload_balance_tolerance: %d\n"
-           XENLOG_INFO " runqueues arrangement: %s\n",
+           XENLOG_INFO " runqueues arrangement: %s\n"
+           XENLOG_INFO " cap enforcement granularity: %dms\n",
            opt_load_precision_shift,
            opt_load_window_shift,
            opt_underload_balance_tolerance,
            opt_overload_balance_tolerance,
-           opt_runqueue_str[opt_runqueue]);
+           opt_runqueue_str[opt_runqueue],
+           opt_cap_period);
 
     if ( opt_load_precision_shift < LOADAVG_PRECISION_SHIFT_MIN )
     {
@@ -3099,6 +3492,13 @@ csched2_init(struct scheduler *ops)
     printk(XENLOG_INFO "load tracking window length %llu ns\n",
            1ULL << opt_load_window_shift);
 
+    if ( CSCHED2_BDGT_REPL_PERIOD < CSCHED2_MIN_TIMER )
+    {
+        printk("WARNING: %s: opt_cap_period %d too small, resetting\n",
+               __func__, opt_cap_period);
+        opt_cap_period = 10; /* ms */
+    }
+
     /* Basically no CPU information is available at this point; just
      * set up basic structures, and a callback when the CPU info is
      * available. */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 1127ca9..2c7f9cc 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -787,6 +787,9 @@ static inline struct domain *next_domain_in_cpupool(
  /* VCPU is being reset. */
 #define _VPF_in_reset        7
 #define VPF_in_reset         (1UL<<_VPF_in_reset)
+/* VCPU is parked. */
+#define _VPF_parked          8
+#define VPF_parked           (1UL<<_VPF_parked)
 
 static inline int vcpu_runnable(struct vcpu *v)
 {