From patchwork Fri Jul 15 14:49:26 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dario Faggioli X-Patchwork-Id: 9232161 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id F141D6075D for ; Fri, 15 Jul 2016 14:52:10 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E1FFF27AC2 for ; Fri, 15 Jul 2016 14:52:10 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D672927F94; Fri, 15 Jul 2016 14:52:10 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.1 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_MED,T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 6F5E327D5D for ; Fri, 15 Jul 2016 14:52:09 +0000 (UTC) Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bO4Qm-0005mu-36; Fri, 15 Jul 2016 14:49:32 +0000 Received: from mail6.bemta6.messagelabs.com ([85.158.143.247]) by lists.xenproject.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bO4Ql-0005mJ-0O for xen-devel@lists.xenproject.org; Fri, 15 Jul 2016 14:49:31 +0000 Received: from [85.158.143.35] by server-2.bemta-6.messagelabs.com id F6/7F-13744-AF7F8875; Fri, 15 Jul 2016 14:49:30 +0000 X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrHIsWRWlGSWpSXmKPExsXiVRvkovvze0e 4wYo3bBbft0xmcmD0OPzhCksAYxRrZl5SfkUCa8a+hrdMBQ07GCvm7nzH1MB4rouxi5GLQ0hg BqPEzv69YA6LwBpWiautj5lAHAmBS6wS75dtZOli5AByYiReN/JDmFUSd9dkdjFyAjWrSNzcv ooJwl7AJPFkViqILSygJ3Hk6A92CNtJ4sHRKWA2m4CBxJsde1lBbBEBJYl7qyaDrWIWaGCUmP B4F9ggFgFViUlth8BsXgEfia+9j8CaOQV8JZ7fX8MCscxH4sXd2WA1ogJyEisvt7BC1AtKnJz 5BOxkZgFNifW79EHCzALyEtvfzmGewCgyC0nVLISqWUiqFjAyr2JUL04tKkst0jXUSyrKTM8o yU3MzNE1NDDTy00tLk5MT81JTCrWS87P3cQIDH8GINjBuPO50yFGSQ4mJVHejSkd4UJ8Sfkpl RmJxRnxRaU5qcWHGDU4OAQmnJ07nUmKJS8/L1VJgvfZN6A6waLU9NSKtMwcYITClEpw8CiJ8L 4DSfMWFyTmFmemQ6ROMepybJl6by2TENgMKXHehSBFAiBFGaV5cCNgyeISo6yUMC8j0IFCPAW pRbmZJajyrxjFORiVhHl/gEzhycwrgdv0CugIJqAjrM3bQY4oSURISTUwVgh9iElk6/Jgsyqs 8Styfq+0cU3TfcOJ8QcXrxX/qzkped03ZqGLTQoWbcICLxNC2+f9ODVNYw9n06q5tnqPvs+5J D5zqoTspF79bY7z3/d9eHo79Pa3E1U8t1fMifqWrOjNL13APu/diXzGwpuTLsSsCY7gE3eb+c 1ox54Eqf/ycyd17+Vfo8RSnJFoqMVcVJwIACh8y4gRAwAA X-Env-Sender: raistlin.df@gmail.com X-Msg-Ref: server-8.tower-21.messagelabs.com!1468594168!24100879!1 X-Originating-IP: [74.125.82.68] X-SpamReason: No, hits=0.0 required=7.0 tests= X-StarScan-Received: X-StarScan-Version: 8.77; banners=-,-,- X-VirusChecked: Checked Received: (qmail 16104 invoked from network); 15 Jul 2016 14:49:29 -0000 Received: from mail-wm0-f68.google.com (HELO mail-wm0-f68.google.com) (74.125.82.68) by server-8.tower-21.messagelabs.com with AES128-GCM-SHA256 encrypted SMTP; 15 Jul 2016 14:49:29 -0000 Received: by mail-wm0-f68.google.com with SMTP id q128so2480829wma.1 for ; Fri, 15 Jul 2016 07:49:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:subject:from:to:cc:date:message-id:in-reply-to:references :user-agent:mime-version:content-transfer-encoding; bh=EMLru5kKoksJazhtBPw6WWEF5PpwlGTHP/M6yt6WcCA=; b=NZweSYU9+hxqRaU1nyDp4kTDs2QO8Bie/CMMvf4kOcnT36+WEL3toEeq05qWd9erLQ /wxRvq92/fzanAeaHUPcmRwIQCBuSX8XKjjJ0ltPNm/U2McmacPEigtr9MgOi2ObmnoE KNPMi8dtewcwDMzyQShB/PAo3yWJpBpTOZETSVMIVF2AT/FF8x7ZuMtFsEC19a1O7Sg5 i2tS42t1mtmcYXCTkuX8W2KnaGV+CnmJB0qBrc33kFYUSWNIRis0iJ2MCRgRH8xG8FTv 4ZnuEQNBPPs7wpdbcuX8nCO8B7Y/RofJm9bLrgqWwQIB/2pq1721dkqp6L7kikAxbx9J 8cww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:subject:from:to:cc:date:message-id :in-reply-to:references:user-agent:mime-version :content-transfer-encoding; bh=EMLru5kKoksJazhtBPw6WWEF5PpwlGTHP/M6yt6WcCA=; b=hkCPVjr1JFIcLTCQCmAXnG7GQg8mARDqrwpc7ALcKq986+vJltZJicMiEf2bUlTNMi IIw7rjPo2H6IIUBui2ZUozxy9giM9fO4A2GNCGbm8k2iDEEmZKNTFUCYpZ7kLXZTMvAm 5cKE47cjN+3U7aYSghEgdEpPkpZF0QPyE/7mGsD3hVGTcMq5ukxkqhMRgctzZbsg2FkG GjpPqjVGpYGX3GeXtiha9qTa3kFg7J2iTdyrTrIZWVdrOqOQg+bFhQ1Pa+ozBx7BGLhQ oqgSUd/zlzwQs/mEc0XRjcguarnAWKlDFx6cHEx5AQbgq6bbs9uZSThp8HGOvRqatQCS azTA== X-Gm-Message-State: ALyK8tKGa+at55jiie3F8/j2Tio4uHcBCv4CK+OxdpL9nxnt0rpDymLeoTh81YiFBwtWOg== X-Received: by 10.195.12.50 with SMTP id en18mr1140021wjd.161.1468594168399; Fri, 15 Jul 2016 07:49:28 -0700 (PDT) Received: from Solace.fritz.box (net-188-217-84-158.cust.vodafonedsl.it. [188.217.84.158]) by smtp.gmail.com with ESMTPSA id f4sm538051wmf.8.2016.07.15.07.49.26 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 15 Jul 2016 07:49:27 -0700 (PDT) From: Dario Faggioli To: xen-devel@lists.xenproject.org Date: Fri, 15 Jul 2016 16:49:26 +0200 Message-ID: <146859416601.10217.14469582386972310679.stgit@Solace.fritz.box> In-Reply-To: <146859397891.10217.10155969474613302167.stgit@Solace.fritz.box> References: <146859397891.10217.10155969474613302167.stgit@Solace.fritz.box> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Cc: Anshul Makkar , George Dunlap , David Vrabel Subject: [Xen-devel] [PATCH v2 03/11] xen: credit2: rework load tracking logic X-BeenThere: xen-devel@lists.xen.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" X-Virus-Scanned: ClamAV using ClamSMTP The existing load tracking code was hard to understad and maintain, and not entirely consistent. This is due to a number of reasons: - code and comments were not in perfect sync, making it difficult to figure out what the intent of a particular choice was (e.g., the choice of 18 for load_window_shift); - the math, although effective, was not entirely consistent. In fact, we were doing (if W is the lenght of the window): avgload = (delta*load*W + (W - delta)*avgload)/W avgload = avgload + delta*load - delta*avgload/W which does not match any known variant of 'smoothing moving average'. In fact, it should have been: avgload = avgload + delta*load/W - delta*avgload/W (for details on why, see the doc comments inside this patch.). Furthermore, with avgload ~= avgload + W*load - avgload avgload ~= W*load The reason why the formula above sort of worked was because the number of bits used for the fractional parts of the values used in fixed point math and the number of bits used for the lenght of the window were the same (load_window_shift was being used for both). This may look handy, but it introduced a (not especially well documented) dependency between the lenght of the window and the precision of the calculations, which really should be two independent things. Especially if treating them as such (like it is done in this patch) does not lead to more complex maths (same number of multiplications and shifts, and there is still room for some optimization). Therefore, in this patch, we: - split length of the window and precision (and, since there is already a command line parameter for length of window, introduce one for precision too), - align the math with one proper incarnation of exponential smoothing (at no added cost), - add comments, about the details of the algorithm and the math used. While there fix a couple of style issues as well (pointless initialization, long lines, comments). Signed-off-by: Dario Faggioli Reviewed-by: George Dunlap --- Changes from v1: * reconciled comments and actual code about load_window_shift handling; * added some more sanity checking for opt_load_window_shift; * changes to trace records moved to next patch, as requested during review. --- I've changed slightly more than just pushing the tracing related hunks to another patch, so I'm not picking up George's Reviewed-by tag. --- Cc: George Dunlap Cc: Anshul Makkar Cc: David Vrabel --- docs/misc/xen-command-line.markdown | 30 +++ xen/common/sched_credit2.c | 323 ++++++++++++++++++++++++++++++----- 2 files changed, 308 insertions(+), 45 deletions(-) diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown index 5500242..3a250cb 100644 --- a/docs/misc/xen-command-line.markdown +++ b/docs/misc/xen-command-line.markdown @@ -485,9 +485,39 @@ the address range the area should fall into. ### credit2\_balance\_under > `= ` +### credit2\_load\_precision\_shift +> `= ` + +> Default: `18` + +Specify the number of bits to use for the fractional part of the +values involved in Credit2 load tracking and load balancing math. + ### credit2\_load\_window\_shift > `= ` +> Default: `30` + +Specify the number of bits to use to represent the length of the +window (in nanoseconds) we use for load tracking inside Credit2. +This means that, with the default value (30), we use +2^30 nsec ~= 1 sec long window. + +Load tracking is done by means of a variation of exponentially +weighted moving average (EWMA). The window length defined here +is what tells for how long we give value to previous history +of the load itself. In fact, after a full window has passed, +what happens is that we discard all previous history entirely. + +A short window will make the load balancer quick at reacting +to load changes, but also short-sighted about previous history +(and hence, e.g., long term load trends). A long window will +make the load balancer thoughtful of previous history (and +hence capable of capturing, e.g., long term load trends), but +also slow in responding to load changes. + +The default value of `1 sec` is rather long. + ### credit2\_runqueue > `= core | socket | node | all` diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 6cb06e8..e695f1b 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -173,16 +173,86 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist); #define RQD(_ops, _cpu) (&CSCHED2_PRIV(_ops)->rqd[c2r(_ops, _cpu)]) /* - * Shifts for load average. - * - granularity: Reduce granularity of time by a factor of 1000, so we can use 32-bit maths - * - window shift: Given granularity shift, make the window about 1 second - * - scale shift: Shift up load by this amount rather than using fractions; 128 corresponds - * to a load of 1. + * Load tracking and load balancing + * + * Load history of runqueues and vcpus is accounted for by using an + * exponential weighted moving average algorithm. However, instead of using + * fractions,we shift everything to left by the number of bits we want to + * use for representing the fractional part (Q-format). + * + * We may also want to reduce the precision of time accounting, to + * accommodate 'longer windows'. So, if that is the case, we just need to + * shift all time samples to the right. + * + * The details of the formulas used for load tracking are explained close to + * __update_runq_load(). Let's just say here that, with full nanosecond time + * granularity, a 30 bits wide 'decaying window' is ~1 second long. + * + * We want to consider the following equations: + * + * avg[0] = load*P + * avg[i+1] = avg[i] + delta*load*P/W - delta*avg[i]/W, 0 <= delta <= W + * + * where W is the lenght of the window, P the multiplier for transitiong into + * Q-format fixed point arithmetic and load is the instantaneous load of a + * runqueue, which basically is the number of runnable vcpus there are on the + * runqueue (for the meaning of the other terms, look at the doc comment to + * __update_runq_load()). + * + * So, again, with full nanosecond granularity, and 1 second window, we have: + * + * W = 2^30 + * P = 2^18 + * + * The maximum possible value for the average load, which we want to store in + * s_time_t type variables (i.e., we have 63 bits available) is load*P. This + * means that, with P 18 bits wide, load can occupy 45 bits. This in turn + * means we can have 2^45 vcpus in each runqueue, before overflow occurs! + * + * However, it can happen that, at step j+1, if: + * + * avg[j] = load*P + * delta = W + * + * then: + * + * avg[j+i] = avg[j] + W*load*P/W - W*load*P/W + * + * So we must be able to deal with W*load*P. This means load can't be higher + * than: + * + * 2^(63 - 30 - 18) = 2^15 = 32768 + * + * So 32768 is the maximum number of vcpus the we can have in a runqueue, + * at any given time, and still not have problems with the load tracking + * calculations... and this is more than fine. + * + * As a matter of fact, since we are using microseconds granularity, we have + * W=2^20. So, still with 18 fractional bits and a 1 second long window, there + * may be 2^25 = 33554432 vcpus in a runq before we have to start thinking + * about overflow. */ -#define LOADAVG_GRANULARITY_SHIFT (10) -static unsigned int __read_mostly opt_load_window_shift = 18; -#define LOADAVG_WINDOW_SHIFT_MIN 4 + +/* If >0, decreases the granularity of time samples used for load tracking. */ +#define LOADAVG_GRANULARITY_SHIFT (10) +/* Time window during which we still give value to previous load history. */ +#define LOADAVG_WINDOW_SHIFT (30) +/* 18 bits by default (and not less than 4) for decimals. */ +#define LOADAVG_PRECISION_SHIFT (18) +#define LOADAVG_PRECISION_SHIFT_MIN (4) + +/* + * Both the lenght of the window and the number of fractional bits can be + * decided with boot parameters. + * + * The length of the window is always expressed in nanoseconds. The actual + * value used by default is LOADAVG_WINDOW_SHIFT - LOADAVG_GRANULARITY_SHIFT. + */ +static unsigned int __read_mostly opt_load_window_shift = LOADAVG_WINDOW_SHIFT; integer_param("credit2_load_window_shift", opt_load_window_shift); +static unsigned int __read_mostly opt_load_precision_shift = LOADAVG_PRECISION_SHIFT; +integer_param("credit2_load_precision_shift", opt_load_precision_shift); + static int __read_mostly opt_underload_balance_tolerance = 0; integer_param("credit2_balance_under", opt_underload_balance_tolerance); static int __read_mostly opt_overload_balance_tolerance = -3; @@ -279,6 +349,7 @@ struct csched2_private { cpumask_t active_queues; /* Queues which may have active cpus */ struct csched2_runqueue_data rqd[NR_CPUS]; + unsigned int load_precision_shift; unsigned int load_window_shift; }; @@ -387,19 +458,147 @@ __runq_elem(struct list_head *elem) return list_entry(elem, struct csched2_vcpu, runq_elem); } +/* + * Track the runq load by gathering instantaneous load samples, and using + * exponentially weighted moving average (EWMA) for the 'decaying'. + * + * We consider a window of lenght W=2^(prv->load_window_shift) nsecs + * (which takes LOADAVG_GRANULARITY_SHIFT into account). + * + * If load is the instantaneous load, the formula for EWMA looks as follows, + * for the i-eth sample: + * + * avg[i] = a*load + (1 - a)*avg[i-1] + * + * where avg[i] is the new value of the average load, avg[i-1] is the value + * of the average load calculated so far, and a is a coefficient less or + * equal to 1. + * + * So, for us, it becomes: + * + * avgload = a*load + (1 - a)*avgload + * + * For determining a, we consider _when_ we are doing the load update, wrt + * the lenght of the window. We define delta as follows: + * + * delta = t - load_last_update + * + * where t is current time (i.e., time at which we are both sampling and + * updating the load average) and load_last_update is the last time we did + * that. + * + * There are two possible situations: + * + * a) delta <= W + * this means that, during the last window of lenght W, the runeuque load + * was avgload for (W - detla) time, and load for delta time: + * + * |----------- W ---------| + * | | + * | load_last_update t + * -------------------------|---------|--- + * | | | + * \__W - delta__/\_delta__/ + * | | | + * |___avgload___|__load___| + * + * So, what about using delta/W as our smoothing coefficient a. If we do, + * here's what happens: + * + * a = delta / W + * 1 - a = 1 - (delta / W) = (W - delta) / W + * + * Which matches the above description of what happened in the last + * window of lenght W. + * + * Note that this also means that the weight that we assign to both the + * latest load sample, and to previous history, varies at each update. + * The longer the latest load sample has been in efect, within the last + * window, the higher it weights (and the lesser the previous history + * weights). + * + * This is some sort of extension of plain EWMA to fit even better to our + * use case. + * + * b) delta > W + * this means more than a full window has passed since the last update: + * + * |----------- W ---------| + * | | + * load_last_update t + * ----|------------------------------|--- + * | | + * \_________________delta________/ + * + * Basically, it means the last load sample has been in effect for more + * than W time, and hence we should just use it, and forget everything + * before that. + * + * This can be seen as a 'reset condition', occurring when, for whatever + * reason, load has not been updated for longer than we expected. (It is + * also how avgload is assigned its first value.) + * + * The formula for avgload then becomes: + * + * avgload = (delta/W)*load + (W - delta)*avgload/W + * avgload = delta*load/W + W*avgload/W - delta*avgload/W + * avgload = avgload + delta*load/W - delta*avgload/W + * + * So, final form is: + * + * avgload_0 = load + * avgload = avgload + delta*load/W - delta*avgload/W, 0<=delta<=W + * + * As a confirmation, let's look at the extremes, when delta is 0 (i.e., + * what happens if we update the load twice, at the same time instant?): + * + * avgload = avgload + 0*load/W - 0*avgload/W + * avgload = avgload + * + * and when delta is W (i.e., what happens if we update at the last + * possible instant before the window 'expires'?): + * + * avgload = avgload + W*load/W - W*avgload/W + * avgload = avgload + load - avgload + * avgload = load + * + * Which, in both cases, is what we expect. + */ static void __update_runq_load(const struct scheduler *ops, struct csched2_runqueue_data *rqd, int change, s_time_t now) { struct csched2_private *prv = CSCHED2_PRIV(ops); - s_time_t delta=-1; + s_time_t delta, load = rqd->load; + unsigned int P, W; + W = prv->load_window_shift; + P = prv->load_precision_shift; now >>= LOADAVG_GRANULARITY_SHIFT; - if ( rqd->load_last_update + (1ULL<load_window_shift) < now ) + /* + * To avoid using fractions, we shift to left by load_precision_shift, + * and use the least last load_precision_shift bits as fractional part. + * Looking back at the formula we want to use, we now have: + * + * P = 2^(load_precision_shift) + * P*avgload = P*(avgload + delta*load/W - delta*avgload/W) + * P*avgload = P*avgload + delta*load*P/W - delta*P*avgload/W + * + * And if we are ok storing and using P*avgload, we can rewrite this as: + * + * P*avgload = avgload' + * avgload' = avgload' + delta*P*load/W - delta*avgload'/W + * + * Coupled with, of course: + * + * avgload_0' = P*load + */ + + if ( rqd->load_last_update + (1ULL << W) < now ) { - rqd->avgload = (unsigned long long)rqd->load << prv->load_window_shift; - rqd->b_avgload = (unsigned long long)rqd->load << prv->load_window_shift; + rqd->avgload = load << P; + rqd->b_avgload = load << P; } else { @@ -411,17 +610,29 @@ __update_runq_load(const struct scheduler *ops, delta = 0; } - rqd->avgload = - ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) ) - + ( ((1ULL<load_window_shift) - delta) * rqd->avgload ) ) >> prv->load_window_shift; - - rqd->b_avgload = - ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) ) - + ( ((1ULL<load_window_shift) - delta) * rqd->b_avgload ) ) >> prv->load_window_shift; + /* + * Note that, if we were to enforce (or check) some relationship + * between P and W, we may save one shift. E.g., if we are sure + * that P < W, we could write: + * + * (delta * (load << P)) >> W + * + * as: + * + * (delta * load) >> (W - P) + */ + rqd->avgload = rqd->avgload + + ((delta * (load << P)) >> W) - + ((delta * rqd->avgload) >> W); + rqd->b_avgload = rqd->b_avgload + + ((delta * (load << P)) >> W) - + ((delta * rqd->b_avgload) >> W); } rqd->load += change; rqd->load_last_update = now; + ASSERT(rqd->avgload <= STIME_MAX && rqd->b_avgload <= STIME_MAX); + { struct { unsigned rq_load:4, rq_avgload:28; @@ -442,8 +653,8 @@ __update_svc_load(const struct scheduler *ops, struct csched2_vcpu *svc, int change, s_time_t now) { struct csched2_private *prv = CSCHED2_PRIV(ops); - s_time_t delta=-1; - int vcpu_load; + s_time_t delta, vcpu_load; + unsigned int P, W; if ( change == -1 ) vcpu_load = 1; @@ -452,11 +663,13 @@ __update_svc_load(const struct scheduler *ops, else vcpu_load = vcpu_runnable(svc->vcpu); + W = prv->load_window_shift; + P = prv->load_precision_shift; now >>= LOADAVG_GRANULARITY_SHIFT; - if ( svc->load_last_update + (1ULL<load_window_shift) < now ) + if ( svc->load_last_update + (1ULL << W) < now ) { - svc->avgload = (unsigned long long)vcpu_load << prv->load_window_shift; + svc->avgload = vcpu_load << P; } else { @@ -468,9 +681,9 @@ __update_svc_load(const struct scheduler *ops, delta = 0; } - svc->avgload = - ( ( delta * ( (unsigned long long)vcpu_load << prv->load_window_shift ) ) - + ( ((1ULL<load_window_shift) - delta) * svc->avgload ) ) >> prv->load_window_shift; + svc->avgload = svc->avgload + + ((delta * (vcpu_load << P)) >> W) - + ((delta * svc->avgload) >> W); } svc->load_last_update = now; @@ -903,7 +1116,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd) svc->credit = CSCHED2_CREDIT_INIT; svc->weight = svc->sdom->weight; /* Starting load of 50% */ - svc->avgload = 1ULL << (CSCHED2_PRIV(ops)->load_window_shift - 1); + svc->avgload = 1ULL << (CSCHED2_PRIV(ops)->load_precision_shift - 1); svc->load_last_update = NOW() >> LOADAVG_GRANULARITY_SHIFT; } else @@ -1152,7 +1365,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc) vcpu_schedule_unlock_irq(lock, vc); } -#define MAX_LOAD (1ULL<<60); +#define MAX_LOAD (STIME_MAX); static int csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc) { @@ -1446,15 +1659,19 @@ retry: if ( i > cpus_max ) cpus_max = i; - /* If we're under 100% capacaty, only shift if load difference - * is > 1. otherwise, shift if under 12.5% */ - if ( load_max < (1ULL<<(prv->load_window_shift))*cpus_max ) + /* + * If we're under 100% capacaty, only shift if load difference + * is > 1. otherwise, shift if under 12.5% + */ + if ( load_max < (cpus_max << prv->load_precision_shift) ) { - if ( st.load_delta < (1ULL<<(prv->load_window_shift+opt_underload_balance_tolerance) ) ) + if ( st.load_delta < (1ULL << (prv->load_precision_shift + + opt_underload_balance_tolerance)) ) goto out; } else - if ( st.load_delta < (1ULL<<(prv->load_window_shift+opt_overload_balance_tolerance)) ) + if ( st.load_delta < (1ULL << (prv->load_precision_shift + + opt_overload_balance_tolerance)) ) goto out; } @@ -1962,7 +2179,7 @@ csched2_schedule( } static void -csched2_dump_vcpu(struct csched2_vcpu *svc) +csched2_dump_vcpu(struct csched2_private *prv, struct csched2_vcpu *svc) { printk("[%i.%i] flags=%x cpu=%i", svc->vcpu->domain->domain_id, @@ -1972,6 +2189,9 @@ csched2_dump_vcpu(struct csched2_vcpu *svc) printk(" credit=%" PRIi32" [w=%u]", svc->credit, svc->weight); + printk(" load=%"PRI_stime" (~%"PRI_stime"%%)", svc->avgload, + (svc->avgload * 100) >> prv->load_precision_shift); + printk("\n"); } @@ -2009,7 +2229,7 @@ csched2_dump_pcpu(const struct scheduler *ops, int cpu) if ( svc ) { printk("\trun: "); - csched2_dump_vcpu(svc); + csched2_dump_vcpu(prv, svc); } loop = 0; @@ -2019,7 +2239,7 @@ csched2_dump_pcpu(const struct scheduler *ops, int cpu) if ( svc ) { printk("\t%3d: ", ++loop); - csched2_dump_vcpu(svc); + csched2_dump_vcpu(prv, svc); } } @@ -2048,8 +2268,8 @@ csched2_dump(const struct scheduler *ops) for_each_cpu(i, &prv->active_queues) { s_time_t fraction; - - fraction = prv->rqd[i].avgload * 100 / (1ULL<load_window_shift); + + fraction = (prv->rqd[i].avgload * 100) >> prv->load_precision_shift; cpulist_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].active); printk("Runqueue %d:\n" @@ -2057,12 +2277,13 @@ csched2_dump(const struct scheduler *ops) "\tcpus = %s\n" "\tmax_weight = %d\n" "\tinstload = %d\n" - "\taveload = %3"PRI_stime"\n", + "\taveload = %"PRI_stime" (~%"PRI_stime"%%)\n", i, cpumask_weight(&prv->rqd[i].active), cpustr, prv->rqd[i].max_weight, prv->rqd[i].load, + prv->rqd[i].avgload, fraction); cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].idle); @@ -2093,7 +2314,7 @@ csched2_dump(const struct scheduler *ops) lock = vcpu_schedule_lock(svc->vcpu); printk("\t%3d: ", ++loop); - csched2_dump_vcpu(svc); + csched2_dump_vcpu(prv, svc); vcpu_schedule_unlock(lock, svc->vcpu); } @@ -2354,17 +2575,27 @@ csched2_init(struct scheduler *ops) " WARNING: This is experimental software in development.\n" \ " Use at your own risk.\n"); + printk(" load_precision_shift: %d\n", opt_load_precision_shift); printk(" load_window_shift: %d\n", opt_load_window_shift); printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance); printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance); printk(" runqueues arrangement: %s\n", opt_runqueue_str[opt_runqueue]); - if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN ) + if ( opt_load_precision_shift < LOADAVG_PRECISION_SHIFT_MIN ) + { + printk("WARNING: %s: opt_load_precision_shift %d below min %d, resetting\n", + __func__, opt_load_precision_shift, LOADAVG_PRECISION_SHIFT_MIN); + opt_load_precision_shift = LOADAVG_PRECISION_SHIFT_MIN; + } + + if ( opt_load_window_shift <= LOADAVG_GRANULARITY_SHIFT ) { - printk("%s: opt_load_window_shift %d below min %d, resetting\n", - __func__, opt_load_window_shift, LOADAVG_WINDOW_SHIFT_MIN); - opt_load_window_shift = LOADAVG_WINDOW_SHIFT_MIN; + printk("WARNING: %s: opt_load_window_shift %d too short, resetting\n", + __func__, opt_load_window_shift); + opt_load_window_shift = LOADAVG_WINDOW_SHIFT; } + printk(XENLOG_INFO "load tracking window lenght %llu ns\n", + 1ULL << opt_load_window_shift); /* Basically no CPU information is available at this point; just * set up basic structures, and a callback when the CPU info is @@ -2385,7 +2616,9 @@ csched2_init(struct scheduler *ops) prv->rqd[i].id = -1; } - prv->load_window_shift = opt_load_window_shift; + prv->load_precision_shift = opt_load_precision_shift; + prv->load_window_shift = opt_load_window_shift - LOADAVG_GRANULARITY_SHIFT; + ASSERT(opt_load_window_shift > 0); return 0; }