diff mbox

[v2,1/1] ratelimit: Implement rate limit for credit2 scheduler Rate limit assures that a vcpu will execute for a minimum amount of time before being put at the back of a queue or being preempted by higher priority thread.

Message ID 1468844544-2229-1-git-send-email-anshul.makkar@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Anshul Makkar July 18, 2016, 12:22 p.m. UTC
It introduces a minimum amount of latency to enable a VM to batch its work and
it also ensures that system is not spending most of its time in
VMEXIT/VMENTRY because of VM that is waking/sleeping at high rate.

ratelimit can be disabled by setting it to 0.

Signed-off-by: Anshul Makkar <anshul.makkar@citrix.com>
---
Changes in v2:
 * algo for time slice calculation based on ratelimit_us has changed.
 * initial value of prv->ratelimit_us.
 * other changes based on review comments.
---
 xen/common/sched_credit2.c | 124 +++++++++++++++++++++++++++++++++------------
 1 file changed, 93 insertions(+), 31 deletions(-)

Comments

Dario Faggioli July 22, 2016, 10:36 a.m. UTC | #1
On Mon, 2016-07-18 at 13:22 +0100, Anshul Makkar wrote:
>
Hey, Anshul.

Thanks, and sorry for the delay in reviewing.

This version is an improvement, but it looks to me that you've missed a
few of the review comments to v1.

> It introduces a minimum amount of latency 
>
"introduces context-switch rate-limiting"

> to enable a VM to batch its work and
> it also ensures that system is not spending most of its time in
> VMEXIT/VMENTRY because of VM that is waking/sleeping at high rate.
> 
> ratelimit can be disabled by setting it to 0.
> 

> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 8b95a47..68bcdb8 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c

> @@ -1601,6 +1602,34 @@ csched2_dom_cntl(
>      return rc;
>  }
>  
> +static int csched2_sys_cntl(const struct scheduler *ops,
> +                            struct xen_sysctl_scheduler_op *sc)
> +{
> +    int rc = -EINVAL;
> +    xen_sysctl_credit_schedule_t *params = &sc->u.sched_credit;
> +    struct csched2_private *prv = CSCHED2_PRIV(ops);
> +    unsigned long flags;
> +
> +    switch (sc->cmd )
> +    {
> +        case XEN_SYSCTL_SCHEDOP_putinfo:
> +            if ( params->ratelimit_us &&
> +                ( params->ratelimit_us < CSCHED2_MIN_TIMER ||
> +                  params->ratelimit_us >
> MICROSECS(CSCHED2_MAX_TIMER) ))
>
CSCHED2_MIN_TIMER and CSCHED2_MAX_TIMER are defined as follows:

#define CSCHED2_MIN_TIMER            MICROSECS(500)
#define CSCHED2_MAX_TIMER            MILLISECS(2)

Which basically means they're value is 500*1000=500000 and
2*1000000=2000000.

ratelimit_us is specified assuming it will be in microseconds. So,
basically, you're forcing the value stored in ratelimit_us to be at
least 500000, which means 500000 microseconds, which means 500
milliseconds, which is not what we want!

I remember saying already (although, it may have be in pvt, not on this
list) that I think we should just use XEN_SYSCTL_SCHED_RATELIMIT_MAX
and XEN_SYSCTL_SCHED_RATELIMIT_MIN here.

CSCHED2_MIN_TIMER and CSCHED2_MAX_TIMER are internal implementation
details, and I don't like them exposed (although, indirectly) to the
user.

> +                return rc;
> +            spin_lock_irqsave(&prv->lock, flags);
> +            prv->ratelimit_us = params->ratelimit_us;
> +            spin_unlock_irqrestore(&prv->lock, flags);
> +            break;
> +
This is ok. However, the code base changed in the meanwhile (sorry! :-
P), and this spin_lock_irqsave() needs to become a
write_lock_irqsave().

> @@ -1688,9 +1719,20 @@ csched2_runtime(const struct scheduler *ops,
> int cpu, struct csched2_vcpu *snext
>       * 1) Run until snext's credit will be 0
>       * 2) But if someone is waiting, run until snext's credit is
> equal
>       * to his
> -     * 3) But never run longer than MAX_TIMER or shorter than
> MIN_TIMER.
> +     * 3) But never run longer than MAX_TIMER or shorter than
> MIN_TIMER or
> +     * run for ratelimit time.
>       */
> 
I prefer George's version of this comment change:

"3) But never run longer than MAX_TIMER or shorter than MIN_TIMER
or the ratelimit time."

>  
> +    /* Calculate mintime */
> +    min_time = CSCHED2_MIN_TIMER;
> +    if ( prv->ratelimit_us ) {
>
Coding style. (Parenthesis goes on next line.)

> +        s_time_t ratelimit_min = prv->ratelimit_us;
> +        ratelimit_min = snext->vcpu->runstate.state_entry_time +
> +            MICROSECS(prv->ratelimit_us) - now;
>
Mmm... if you wanted to implement my suggestion from
<1468400021.13039.33.camel@citrix.com>, you're definitely missing
something:

     s_time_t ratelimit_min = prv->ratelimit_us;
     if ( snext->vcpu->is_running )
         ratelimit_min = snext->vcpu->runstate.state_entry_time +
                         MICROSECS(prv->ratelimit_us) - now;

In fact, you're initializing ratelimit_min and then immediately
overriding that... I'm surprised the compiler didn't complain.

> +    if ( ratelimit_min > min_time )
> +        min_time = ratelimit_min;
> +    }
> +

> @@ -1707,32 +1749,33 @@ csched2_runtime(const struct scheduler *ops,
> int cpu, struct csched2_vcpu *snext
>          }
>      }
>  
> -    /* The next guy may actually have a higher credit, if we've
> tried to
> -     * avoid migrating him from a different cpu.  DTRT.  */
> -    if ( rt_credit <= 0 )
> +    /*
> +     * The next guy ont the runqueue may actually have a higher
> credit,
> +     * if we've tried to avoid migrating him from a different cpu.
> +     * Setting time=0 will ensure the minimum timeslice is chosen.
>
George's draft patch had an empty line within this comment in here,
separating the two paragraph. Can we keep it? (I know, this is a very
minor thing, but since we're here :-D)

> +     * FIXME: See if we can eliminate this conversion if we know
> time
> +     * will be outside (MIN,MAX).  Probably requires pre-calculating
> +     * credit values of MIN,MAX per vcpu, since each vcpu burns
> credit
> +     * at a different rate.
> +     */
> +    if (rt_credit > 0)
> +        time = c2t(rqd, rt_credit, snext);
> +    else
> +        time = 0;
> +
> +    /*
> +     * Never run longer than MAX_TIMER or less than MIN_TIMER or for
> +     * rate_limit time.
> +     */
>
    /* 
     * 3) But never run longer than MAX_TIMER or shorter than MIN_TIMER
     * or the ratelimit time.
     */

i.e., fix the style of the comment (wrt George's patch) by putting the
"wings" there, but please, keep the "3)", and the fact that it
basically mirrors the big comment at the beginning of the function.

> @@ -1746,7 +1789,7 @@ void __dump_execstate(void *unused);
>  static struct csched2_vcpu *
>  runq_candidate(struct csched2_runqueue_data *rqd,
>                 struct csched2_vcpu *scurr,
> -               int cpu, s_time_t now)
> +               int cpu, s_time_t now, struct csched2_private *prv)
>
Reviewing v1, George said this:

  Since we have the cpu, I think we can get ops this way, without
  cluttering things up with the extra argument:

      struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));

> @@ -1775,9 +1829,13 @@ runq_candidate(struct csched2_runqueue_data
> *rqd,
>          }
>  
>          /* If the next one on the list has more credit than current
> -         * (or idle, if current is not runnable), choose it. */
> +         * (or idle, if current is not runnable) and current one has
> already
> +         * executed for more than ratelimit. choose it.
> +         * Control has reached here means that current vcpu has
> executed >
> +         * ratelimit_us or ratelimit is off, so chose the next one.
> +         */
>          if ( svc->credit > snext->credit )
> -            snext = svc;
> +                snext = svc;
>  
Both me and George agreed that changing the comment like this is not
helping much and should not be done.

You also (pointed out already as well) don't need to touch the
'snext = svc' line.

Regards,
Dario
Anshul Makkar July 25, 2016, 2:36 p.m. UTC | #2
On 22/07/16 11:36, Dario Faggioli wrote:
> On Mon, 2016-07-18 at 13:22 +0100, Anshul Makkar wrote:
>>
> Hey, Anshul.
>
> Thanks, and sorry for the delay in reviewing.
>
> This version is an improvement, but it looks to me that you've missed a
> few of the review comments to v1.
Sorry about that. !!
>> It introduces a minimum amount of latency
>>
> "introduces context-switch rate-limiting"
>> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
>> index 8b95a47..68bcdb8 100644
>> --- a/xen/common/sched_credit2.c
>> +++ b/xen/common/sched_credit2.c
>>
>> @@ -1601,6 +1602,34 @@ csched2_dom_cntl(
>> +    switch (sc->cmd )
>> +    {
>> +        case XEN_SYSCTL_SCHEDOP_putinfo:
>> +            if ( params->ratelimit_us &&
>> +                ( params->ratelimit_us < CSCHED2_MIN_TIMER ||
>> +                  params->ratelimit_us >
> I remember saying already (although, it may have be in pvt, not on this
> list) that I think we should just use XEN_SYSCTL_SCHED_RATELIMIT_MAX
> and XEN_SYSCTL_SCHED_RATELIMIT_MIN here.
>
> CSCHED2_MIN_TIMER and CSCHED2_MAX_TIMER are internal implementation
> details, and I don't like them exposed (although, indirectly) to the
> user.
addressed.
>> +                return rc;
>> +            spin_lock_irqsave(&prv->lock, flags);
>> +
> This is ok. However, the code base changed in the meanwhile (sorry! :-
> P), and this spin_lock_irqsave() needs to become a
> write_lock_irqsave().
done.
>
> Mmm... if you wanted to implement my suggestion from
> <1468400021.13039.33.camel@citrix.com>, you're definitely missing
> something:
>
>       s_time_t ratelimit_min = prv->ratelimit_us;
>       if ( snext->vcpu->is_running )
>           ratelimit_min = snext->vcpu->runstate.state_entry_time +
>                           MICROSECS(prv->ratelimit_us) - now;
>
yes, missed the if part for checking if the vcpu is currently running.
> In fact, you're initializing ratelimit_min and then immediately
> overriding that... I'm surprised the compiler didn't complain.
>
>> +    if ( ratelimit_min > min_time )
>> +        min_time = ratelimit_min;
>> +    }
>> +
>
>> @@ -1707,32 +1749,33 @@ csched2_runtime(const struct scheduler *ops,
>> int cpu, struct csched2_vcpu *snext
>>           }
>>       }
>>
>
>> @@ -1746,7 +1789,7 @@ void __dump_execstate(void *unused);
>>   static struct csched2_vcpu *
>>   runq_candidate(struct csched2_runqueue_data *rqd,
>>                  struct csched2_vcpu *scurr,
>> -               int cpu, s_time_t now)
>> +               int cpu, s_time_t now, struct csched2_private *prv)
>>
> Reviewing v1, George said this:
>
>    Since we have the cpu, I think we can get ops this way, without
>    cluttering things up with the extra argument:
>
>        struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
yes, missed that change too. Addressed in v3.
>
>> @@ -1775,9 +1829,13 @@ runq_candidate(struct csched2_runqueue_data
>> *rqd,
>>           }
>>
>>           /* If the next one on the list has more credit than current
>> -         * (or idle, if current is not runnable), choose it. */
>> +         * (or idle, if current is not runnable) and current one has
>> already
>> +         * executed for more than ratelimit. choose it.
>> +         * Control has reached here means that current vcpu has
>> executed >
>> +         * ratelimit_us or ratelimit is off, so chose the next one.
>> +         */
>>           if ( svc->credit > snext->credit )
>> -            snext = svc;
>> +                snext = svc;
>>
> Both me and George agreed that changing the comment like this is not
> helping much and should not be done.
Though, I find the extended comment useful, but if you don't agree I 
will remove it v3.

>
> Regards,
> Dario
>
Anshul
Dario Faggioli July 26, 2016, 10:50 a.m. UTC | #3
On Mon, 2016-07-25 at 15:36 +0100, anshul makkar wrote:
> On 22/07/16 11:36, Dario Faggioli wrote:
> > 
> > This version is an improvement, but it looks to me that you've
> > missed a
> > few of the review comments to v1.
>
> Sorry about that. !!
>
NP. It's indeed something quite important... but it happens from time
to time. :-)

> @@ -1775,9 +1829,13 @@ runq_candidate(struct
> > > csched2_runqueue_data
> > > *rqd,
> > >           }
> > > 
> > >           /* If the next one on the list has more credit than
> > > current
> > > -         * (or idle, if current is not runnable), choose it. */
> > > +         * (or idle, if current is not runnable) and current one
> > > has
> > > already
> > > +         * executed for more than ratelimit. choose it.
> > > +         * Control has reached here means that current vcpu has
> > > executed >
> > > +         * ratelimit_us or ratelimit is off, so chose the next
> > > one.
> > > +         */
> > >           if ( svc->credit > snext->credit )
> > > -            snext = svc;
> > > +                snext = svc;
> > > 
> > Both me and George agreed that changing the comment like this is
> > not
> > helping much and should not be done.
> Though, I find the extended comment useful, but if you don't agree I 
> will remove it v3.
> 
So, I just wanted to reply to this, then I'll go looking at v3.

Things like this are, up to a rather high extent, a matter of taste.
And I see why you think you're adding useful information, and I'm not
saying that is completely untrue.

So, I'm explaining why I'm asking you to remove this, and then I'll go
looking at v3. :-)  As said, it's not that what you add is completely
useless. But that particular piece of code the comment is referring to
has nothing to do with ratelimiting, and I thus just don't think it is
useful to have its comment talking about ratelimit.

Actually, it may be confusing. In fact, if I'm reading the code trying
to focus on something that is not ratelimit, a comment like this risks
driving my focus away, and forcing me to think what ratelimiting has to
do with this.

In fact, there are two other possible reasons why we may not get to
this point of the loop, and each of them:
 - has its own comment where it is actually checked for/enforced;
 - does not have its own comment repeated here (e.g., like, for
   affinity, "Control has reached here means that svc can run on this 
   cpu.").

So, the ratelimit related aspects should be put in a comment close to
the code that checks and enforces ratelimiting, which is there already,
in your patch, where they will be:
 1. helpful to people trying to understand rateliminting;
 2. out of the way of people reasoning on anything else than 
    ratelimiting.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
diff mbox

Patch

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 8b95a47..68bcdb8 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -280,6 +280,7 @@  struct csched2_private {
     struct csched2_runqueue_data rqd[NR_CPUS];
 
     unsigned int load_window_shift;
+    unsigned ratelimit_us; /* each cpupool can have its onw ratelimit */
 };
 
 /*
@@ -1601,6 +1602,34 @@  csched2_dom_cntl(
     return rc;
 }
 
+static int csched2_sys_cntl(const struct scheduler *ops,
+                            struct xen_sysctl_scheduler_op *sc)
+{
+    int rc = -EINVAL;
+    xen_sysctl_credit_schedule_t *params = &sc->u.sched_credit;
+    struct csched2_private *prv = CSCHED2_PRIV(ops);
+    unsigned long flags;
+
+    switch (sc->cmd )
+    {
+        case XEN_SYSCTL_SCHEDOP_putinfo:
+            if ( params->ratelimit_us &&
+                ( params->ratelimit_us < CSCHED2_MIN_TIMER ||
+                  params->ratelimit_us > MICROSECS(CSCHED2_MAX_TIMER) ))
+                return rc;
+            spin_lock_irqsave(&prv->lock, flags);
+            prv->ratelimit_us = params->ratelimit_us;
+            spin_unlock_irqrestore(&prv->lock, flags);
+            break;
+
+        case XEN_SYSCTL_SCHEDOP_getinfo:
+            params->ratelimit_us = prv->ratelimit_us;
+            rc = 0;
+            break;
+    }
+    return rc;
+}
+
 static void *
 csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom)
 {
@@ -1670,12 +1699,14 @@  csched2_dom_destroy(const struct scheduler *ops, struct domain *dom)
 
 /* How long should we let this vcpu run for? */
 static s_time_t
-csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext)
+csched2_runtime(const struct scheduler *ops, int cpu,
+                struct csched2_vcpu *snext, s_time_t now)
 {
-    s_time_t time; 
+    s_time_t time, min_time;
     int rt_credit; /* Proposed runtime measured in credits */
     struct csched2_runqueue_data *rqd = RQD(ops, cpu);
     struct list_head *runq = &rqd->runq;
+    struct csched2_private *prv = CSCHED2_PRIV(ops);
 
     /*
      * If we're idle, just stay so. Others (or external events)
@@ -1688,9 +1719,20 @@  csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext
      * 1) Run until snext's credit will be 0
      * 2) But if someone is waiting, run until snext's credit is equal
      * to his
-     * 3) But never run longer than MAX_TIMER or shorter than MIN_TIMER.
+     * 3) But never run longer than MAX_TIMER or shorter than MIN_TIMER or
+     * run for ratelimit time.
      */
 
+    /* Calculate mintime */
+    min_time = CSCHED2_MIN_TIMER;
+    if ( prv->ratelimit_us ) {
+        s_time_t ratelimit_min = prv->ratelimit_us;
+        ratelimit_min = snext->vcpu->runstate.state_entry_time +
+            MICROSECS(prv->ratelimit_us) - now;
+    if ( ratelimit_min > min_time )
+        min_time = ratelimit_min;
+    }
+
     /* 1) Basic time: Run until credit is 0. */
     rt_credit = snext->credit;
 
@@ -1707,32 +1749,33 @@  csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext
         }
     }
 
-    /* The next guy may actually have a higher credit, if we've tried to
-     * avoid migrating him from a different cpu.  DTRT.  */
-    if ( rt_credit <= 0 )
+    /*
+     * The next guy ont the runqueue may actually have a higher credit,
+     * if we've tried to avoid migrating him from a different cpu.
+     * Setting time=0 will ensure the minimum timeslice is chosen.
+     * FIXME: See if we can eliminate this conversion if we know time
+     * will be outside (MIN,MAX).  Probably requires pre-calculating
+     * credit values of MIN,MAX per vcpu, since each vcpu burns credit
+     * at a different rate.
+     */
+    if (rt_credit > 0)
+        time = c2t(rqd, rt_credit, snext);
+    else
+        time = 0;
+
+    /*
+     * Never run longer than MAX_TIMER or less than MIN_TIMER or for
+     * rate_limit time.
+     */
+    if ( time < min_time)
     {
-        time = CSCHED2_MIN_TIMER;
-        SCHED_STAT_CRANK(runtime_min_timer);
+       time = min_time;
+       SCHED_STAT_CRANK(runtime_min_timer);
     }
-    else
+    else if (time > CSCHED2_MAX_TIMER)
     {
-        /* FIXME: See if we can eliminate this conversion if we know time
-         * will be outside (MIN,MAX).  Probably requires pre-calculating
-         * credit values of MIN,MAX per vcpu, since each vcpu burns credit
-         * at a different rate. */
-        time = c2t(rqd, rt_credit, snext);
-
-        /* Check limits */
-        if ( time < CSCHED2_MIN_TIMER )
-        {
-            time = CSCHED2_MIN_TIMER;
-            SCHED_STAT_CRANK(runtime_min_timer);
-        }
-        else if ( time > CSCHED2_MAX_TIMER )
-        {
-            time = CSCHED2_MAX_TIMER;
-            SCHED_STAT_CRANK(runtime_max_timer);
-        }
+        time = CSCHED2_MAX_TIMER;
+        SCHED_STAT_CRANK(runtime_max_timer);
     }
 
     return time;
@@ -1746,7 +1789,7 @@  void __dump_execstate(void *unused);
 static struct csched2_vcpu *
 runq_candidate(struct csched2_runqueue_data *rqd,
                struct csched2_vcpu *scurr,
-               int cpu, s_time_t now)
+               int cpu, s_time_t now, struct csched2_private *prv)
 {
     struct list_head *iter;
     struct csched2_vcpu *snext = NULL;
@@ -1757,6 +1800,17 @@  runq_candidate(struct csched2_runqueue_data *rqd,
     else
         snext = CSCHED2_VCPU(idle_vcpu[cpu]);
 
+    /*
+     * Return the current vcpu if it has executed for less than ratelimit.
+     * Adjuststment for the selected vcpu's credit and decision
+     * for how long it will run will be taken in csched2_runtime.
+     */
+    if ( prv->ratelimit_us && !is_idle_vcpu(scurr->vcpu) &&
+         vcpu_runnable(scurr->vcpu) &&
+         (now - scurr->vcpu->runstate.state_entry_time) <
+          MICROSECS(prv->ratelimit_us) )
+        return scurr;
+
     list_for_each( iter, &rqd->runq )
     {
         struct csched2_vcpu * svc = list_entry(iter, struct csched2_vcpu, runq_elem);
@@ -1775,9 +1829,13 @@  runq_candidate(struct csched2_runqueue_data *rqd,
         }
 
         /* If the next one on the list has more credit than current
-         * (or idle, if current is not runnable), choose it. */
+         * (or idle, if current is not runnable) and current one has already
+         * executed for more than ratelimit. choose it.
+         * Control has reached here means that current vcpu has executed >
+         * ratelimit_us or ratelimit is off, so chose the next one.
+         */
         if ( svc->credit > snext->credit )
-            snext = svc;
+                snext = svc;
 
         /* In any case, if we got this far, break. */
         break;
@@ -1800,6 +1858,7 @@  csched2_schedule(
     struct csched2_vcpu * const scurr = CSCHED2_VCPU(current);
     struct csched2_vcpu *snext = NULL;
     struct task_slice ret;
+    struct csched2_private *prv = CSCHED2_PRIV(ops);
 
     SCHED_STAT_CRANK(schedule);
     CSCHED2_VCPU_CHECK(current);
@@ -1870,7 +1929,7 @@  csched2_schedule(
         snext = CSCHED2_VCPU(idle_vcpu[cpu]);
     }
     else
-        snext=runq_candidate(rqd, scurr, cpu, now);
+        snext=runq_candidate(rqd, scurr, cpu, now, prv);
 
     /* If switching from a non-idle runnable vcpu, put it
      * back on the runqueue. */
@@ -1934,7 +1993,7 @@  csched2_schedule(
     /*
      * Return task to run next...
      */
-    ret.time = csched2_runtime(ops, cpu, snext);
+    ret.time = csched2_runtime(ops, cpu, snext, now);
     ret.task = snext->vcpu;
 
     CSCHED2_VCPU_CHECK(ret.task);
@@ -2366,6 +2425,8 @@  csched2_init(struct scheduler *ops)
         prv->runq_map[i] = -1;
         prv->rqd[i].id = -1;
     }
+    /* initialize ratelimit */
+    prv->ratelimit_us = sched_ratelimit_us;
 
     prv->load_window_shift = opt_load_window_shift;
 
@@ -2398,6 +2459,7 @@  static const struct scheduler sched_credit2_def = {
     .wake           = csched2_vcpu_wake,
 
     .adjust         = csched2_dom_cntl,
+    .adjust_global  = csched2_sys_cntl,
 
     .pick_cpu       = csched2_cpu_pick,
     .migrate        = csched2_vcpu_migrate,