diff mbox series

[v2,1/2] credit: Limit load balancing to once per millisecond

Message ID 20230921122352.2307574-1-george.dunlap@cloud.com (mailing list archive)
State New, archived
Headers show
Series [v2,1/2] credit: Limit load balancing to once per millisecond | expand

Commit Message

George Dunlap Sept. 21, 2023, 12:23 p.m. UTC
The credit scheduler tries as hard as it can to ensure that it always
runs scheduling units with positive credit (PRI_TS_UNDER) before
running those with negative credit (PRI_TS_OVER).  If the next
runnable scheduling unit is of priority OVER, it will always run the
load balancer, which will scour the system looking for another
scheduling unit of the UNDER priority.

Unfortunately, as the number of cores on a system has grown, the cost
of the work-stealing algorithm has dramatically increased; a recent
trace on a system with 128 cores showed this taking over 50
microseconds.

Add a parameter, load_balance_ratelimit, to limit the frequency of
load balance operations on a given pcpu.  Default this to 1
millisecond.

Invert the load balancing conditional to make it more clear, and line
up more closely with the comment above it.

Overall it might be cleaner to have the last_load_balance checking
happen inside csched_load_balance(), but that would require either
passing both now and spc into the function, or looking them up again;
both of which seemed to be worse than simply checking and setting the
values before calling it.

On a system with a vcpu:pcpu ratio of 2:1, running Windows guests
(which will end up calling YIELD during spinlock contention), this
patch increased performance significantly.

Signed-off-by: George Dunlap <george.dunlap@cloud.com>
---
Changes since v1:
- Fix editing mistake in commit message
- Improve documentation
- global var is __ro_after_init
- Remove sysctl, as it's not used.  Define max value in credit.c.
- Fix some style issues
- Move comment tweak to the right patch
- In the event that the commandline-parameter value is too high, clip
  to the maximum value rather than setting to the default.

CC: Dario Faggioli <dfaggioli@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: George Dunlap <george.dunlap@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Wei Liu <wl@xen.org>
---
 docs/misc/xen-command-line.pandoc |  8 ++++++
 xen/common/sched/credit.c         | 47 +++++++++++++++++++++++++------
 2 files changed, 46 insertions(+), 9 deletions(-)

Comments

Jürgen Groß Sept. 21, 2023, 1:03 p.m. UTC | #1
On 21.09.23 14:23, George Dunlap wrote:
> The credit scheduler tries as hard as it can to ensure that it always
> runs scheduling units with positive credit (PRI_TS_UNDER) before
> running those with negative credit (PRI_TS_OVER).  If the next
> runnable scheduling unit is of priority OVER, it will always run the
> load balancer, which will scour the system looking for another
> scheduling unit of the UNDER priority.
> 
> Unfortunately, as the number of cores on a system has grown, the cost
> of the work-stealing algorithm has dramatically increased; a recent
> trace on a system with 128 cores showed this taking over 50
> microseconds.
> 
> Add a parameter, load_balance_ratelimit, to limit the frequency of
> load balance operations on a given pcpu.  Default this to 1
> millisecond.
> 
> Invert the load balancing conditional to make it more clear, and line
> up more closely with the comment above it.
> 
> Overall it might be cleaner to have the last_load_balance checking
> happen inside csched_load_balance(), but that would require either
> passing both now and spc into the function, or looking them up again;
> both of which seemed to be worse than simply checking and setting the
> values before calling it.
> 
> On a system with a vcpu:pcpu ratio of 2:1, running Windows guests
> (which will end up calling YIELD during spinlock contention), this
> patch increased performance significantly.
> 
> Signed-off-by: George Dunlap <george.dunlap@cloud.com>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen
Jürgen Groß Sept. 21, 2023, 1:12 p.m. UTC | #2
On 21.09.23 14:23, George Dunlap wrote:
> The credit scheduler tries as hard as it can to ensure that it always
> runs scheduling units with positive credit (PRI_TS_UNDER) before
> running those with negative credit (PRI_TS_OVER).  If the next
> runnable scheduling unit is of priority OVER, it will always run the
> load balancer, which will scour the system looking for another
> scheduling unit of the UNDER priority.
> 
> Unfortunately, as the number of cores on a system has grown, the cost
> of the work-stealing algorithm has dramatically increased; a recent
> trace on a system with 128 cores showed this taking over 50
> microseconds.
> 
> Add a parameter, load_balance_ratelimit, to limit the frequency of
> load balance operations on a given pcpu.  Default this to 1
> millisecond.
> 
> Invert the load balancing conditional to make it more clear, and line
> up more closely with the comment above it.
> 
> Overall it might be cleaner to have the last_load_balance checking
> happen inside csched_load_balance(), but that would require either
> passing both now and spc into the function, or looking them up again;
> both of which seemed to be worse than simply checking and setting the
> values before calling it.
> 
> On a system with a vcpu:pcpu ratio of 2:1, running Windows guests
> (which will end up calling YIELD during spinlock contention), this
> patch increased performance significantly.
> 
> Signed-off-by: George Dunlap <george.dunlap@cloud.com>
> ---
> Changes since v1:
> - Fix editing mistake in commit message
> - Improve documentation
> - global var is __ro_after_init
> - Remove sysctl, as it's not used.  Define max value in credit.c.
> - Fix some style issues
> - Move comment tweak to the right patch
> - In the event that the commandline-parameter value is too high, clip
>    to the maximum value rather than setting to the default.
> 
> CC: Dario Faggioli <dfaggioli@suse.com>
> CC: Andrew Cooper <andrew.cooper3@citrix.com>
> CC: George Dunlap <george.dunlap@citrix.com>
> CC: Jan Beulich <jbeulich@suse.com>
> CC: Julien Grall <julien@xen.org>
> CC: Stefano Stabellini <sstabellini@kernel.org>
> CC: Wei Liu <wl@xen.org>
> ---
>   docs/misc/xen-command-line.pandoc |  8 ++++++
>   xen/common/sched/credit.c         | 47 +++++++++++++++++++++++++------
>   2 files changed, 46 insertions(+), 9 deletions(-)
> 
> diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
> index f88e6a70ae..9c3c72a7f9 100644
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -1884,6 +1884,14 @@ By default, Xen will use the INVPCID instruction for TLB management if
>   it is available.  This option can be used to cause Xen to fall back to
>   older mechanisms, which are generally slower.
>   
> +### load-balance-ratelimit
> +> `= <integer>`
> +
> +The minimum interval between load balancing events on a given pcpu, in
> +microseconds.  A value of '0' will disable rate limiting.  Maximum
> +value 1 second. At the moment only credit honors this parameter.
> +Default 1ms.
> +
>   ### noirqbalance (x86)
>   > `= <boolean>`
>   
> diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c
> index f2cd3d9da3..5c06f596d2 100644
> --- a/xen/common/sched/credit.c
> +++ b/xen/common/sched/credit.c
> @@ -50,6 +50,10 @@
>   #define CSCHED_TICKS_PER_TSLICE     3
>   /* Default timeslice: 30ms */
>   #define CSCHED_DEFAULT_TSLICE_MS    30
> +/* Default load balancing ratelimit: 1ms */
> +#define CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US 1000
> +/* Max load balancing ratelimit: 1s */
> +#define CSCHED_MAX_LOAD_BALANCE_RATELIMIT_US     1000000
>   #define CSCHED_CREDITS_PER_MSEC     10
>   /* Never set a timer shorter than this value. */
>   #define CSCHED_MIN_TIMER            XEN_SYSCTL_SCHED_RATELIMIT_MIN
> @@ -153,6 +157,7 @@ struct csched_pcpu {
>   
>       unsigned int idle_bias;
>       unsigned int nr_runnable;
> +    s_time_t last_load_balance;
>   
>       unsigned int tick;
>       struct timer ticker;
> @@ -218,7 +223,7 @@ struct csched_private {
>   
>       /* Period of master and tick in milliseconds */
>       unsigned int tick_period_us, ticks_per_tslice;
> -    s_time_t ratelimit, tslice, unit_migr_delay;
> +    s_time_t ratelimit, tslice, unit_migr_delay, load_balance_ratelimit;
>   
>       struct list_head active_sdom;
>       uint32_t weight;
> @@ -612,6 +617,8 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
>       BUG_ON(!is_idle_unit(curr_on_cpu(cpu)));
>       cpumask_set_cpu(cpu, prv->idlers);
>       spc->nr_runnable = 0;
> +
> +    spc->last_load_balance = NOW();
>   }
>   
>   static void cf_check
> @@ -1676,9 +1683,17 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>       return NULL;
>   }
>   
> +/*
> + * Minimum delay, in microseconds, between load balance operations.
> + * This prevents spending too much time doing load balancing, particularly
> + * when the system has a high number of YIELDs due to spinlock priority inversion.
> + */
> +static unsigned int __ro_after_init load_balance_ratelimit_us = CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;
> +integer_param("load-balance-ratelimit", load_balance_ratelimit_us);
> +
>   static struct csched_unit *
>   csched_load_balance(struct csched_private *prv, int cpu,
> -    struct csched_unit *snext, bool *stolen)
> +                    struct csched_unit *snext, bool *stolen)
>   {
>       const struct cpupool *c = get_sched_res(cpu)->cpupool;
>       struct csched_unit *speer;
> @@ -1958,15 +1973,19 @@ static void cf_check csched_schedule(
>           /*
>            * SMP Load balance:
>            *
> -         * If the next highest priority local runnable UNIT has already eaten
> -         * through its credits, look on other PCPUs to see if we have more
> -         * urgent work... If not, csched_load_balance() will return snext, but
> -         * already removed from the runq.
> +         * If the next highest priority local runnable UNIT has
> +         * already eaten through its credits (and we're below the
> +         * balancing ratelimit), look on other PCPUs to see if we have
> +         * more urgent work... If we don't, csched_load_balance() will
> +         * return snext, but already removed from the runq.
>            */
> -        if ( snext->pri > CSCHED_PRI_TS_OVER )
> -            __runq_remove(snext);
> -        else
> +        if ( snext->pri <= CSCHED_PRI_TS_OVER
> +             && now - spc->last_load_balance > prv->load_balance_ratelimit) {

^ Just found a style issue (after Andrew pointing out the ones in patch 2).

My R-b stands with that fixed.


Juergen
Henry Wang Sept. 22, 2023, 1:30 a.m. UTC | #3
Hi George,

> On Sep 21, 2023, at 20:23, George Dunlap <george.dunlap@cloud.com> wrote:
> 
> The credit scheduler tries as hard as it can to ensure that it always
> runs scheduling units with positive credit (PRI_TS_UNDER) before
> running those with negative credit (PRI_TS_OVER).  If the next
> runnable scheduling unit is of priority OVER, it will always run the
> load balancer, which will scour the system looking for another
> scheduling unit of the UNDER priority.
> 
> Unfortunately, as the number of cores on a system has grown, the cost
> of the work-stealing algorithm has dramatically increased; a recent
> trace on a system with 128 cores showed this taking over 50
> microseconds.
> 
> Add a parameter, load_balance_ratelimit, to limit the frequency of
> load balance operations on a given pcpu.  Default this to 1
> millisecond.
> 
> Invert the load balancing conditional to make it more clear, and line
> up more closely with the comment above it.
> 
> Overall it might be cleaner to have the last_load_balance checking
> happen inside csched_load_balance(), but that would require either
> passing both now and spc into the function, or looking them up again;
> both of which seemed to be worse than simply checking and setting the
> values before calling it.
> 
> On a system with a vcpu:pcpu ratio of 2:1, running Windows guests
> (which will end up calling YIELD during spinlock contention), this
> patch increased performance significantly.
> 
> Signed-off-by: George Dunlap <george.dunlap@cloud.com>

Release-acked-by: Henry Wang <Henry.Wang@arm.com>

Kind regards,
Henry
George Dunlap Sept. 22, 2023, 12:20 p.m. UTC | #4
On Thu, Sep 21, 2023 at 2:12 PM Juergen Gross <jgross@suse.com> wrote:
>
> On 21.09.23 14:23, George Dunlap wrote:
> > The credit scheduler tries as hard as it can to ensure that it always
> > runs scheduling units with positive credit (PRI_TS_UNDER) before
> > running those with negative credit (PRI_TS_OVER).  If the next
> > runnable scheduling unit is of priority OVER, it will always run the
> > load balancer, which will scour the system looking for another
> > scheduling unit of the UNDER priority.
> >
> > Unfortunately, as the number of cores on a system has grown, the cost
> > of the work-stealing algorithm has dramatically increased; a recent
> > trace on a system with 128 cores showed this taking over 50
> > microseconds.
> >
> > Add a parameter, load_balance_ratelimit, to limit the frequency of
> > load balance operations on a given pcpu.  Default this to 1
> > millisecond.
> >
> > Invert the load balancing conditional to make it more clear, and line
> > up more closely with the comment above it.
> >
> > Overall it might be cleaner to have the last_load_balance checking
> > happen inside csched_load_balance(), but that would require either
> > passing both now and spc into the function, or looking them up again;
> > both of which seemed to be worse than simply checking and setting the
> > values before calling it.
> >
> > On a system with a vcpu:pcpu ratio of 2:1, running Windows guests
> > (which will end up calling YIELD during spinlock contention), this
> > patch increased performance significantly.
> >
> > Signed-off-by: George Dunlap <george.dunlap@cloud.com>
> > ---
> > Changes since v1:
> > - Fix editing mistake in commit message
> > - Improve documentation
> > - global var is __ro_after_init
> > - Remove sysctl, as it's not used.  Define max value in credit.c.
> > - Fix some style issues
> > - Move comment tweak to the right patch
> > - In the event that the commandline-parameter value is too high, clip
> >    to the maximum value rather than setting to the default.
> >
> > CC: Dario Faggioli <dfaggioli@suse.com>
> > CC: Andrew Cooper <andrew.cooper3@citrix.com>
> > CC: George Dunlap <george.dunlap@citrix.com>
> > CC: Jan Beulich <jbeulich@suse.com>
> > CC: Julien Grall <julien@xen.org>
> > CC: Stefano Stabellini <sstabellini@kernel.org>
> > CC: Wei Liu <wl@xen.org>
> > ---
> >   docs/misc/xen-command-line.pandoc |  8 ++++++
> >   xen/common/sched/credit.c         | 47 +++++++++++++++++++++++++------
> >   2 files changed, 46 insertions(+), 9 deletions(-)
> >
> > diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
> > index f88e6a70ae..9c3c72a7f9 100644
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -1884,6 +1884,14 @@ By default, Xen will use the INVPCID instruction for TLB management if
> >   it is available.  This option can be used to cause Xen to fall back to
> >   older mechanisms, which are generally slower.
> >
> > +### load-balance-ratelimit
> > +> `= <integer>`
> > +
> > +The minimum interval between load balancing events on a given pcpu, in
> > +microseconds.  A value of '0' will disable rate limiting.  Maximum
> > +value 1 second. At the moment only credit honors this parameter.
> > +Default 1ms.
> > +
> >   ### noirqbalance (x86)
> >   > `= <boolean>`
> >
> > diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c
> > index f2cd3d9da3..5c06f596d2 100644
> > --- a/xen/common/sched/credit.c
> > +++ b/xen/common/sched/credit.c
> > @@ -50,6 +50,10 @@
> >   #define CSCHED_TICKS_PER_TSLICE     3
> >   /* Default timeslice: 30ms */
> >   #define CSCHED_DEFAULT_TSLICE_MS    30
> > +/* Default load balancing ratelimit: 1ms */
> > +#define CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US 1000
> > +/* Max load balancing ratelimit: 1s */
> > +#define CSCHED_MAX_LOAD_BALANCE_RATELIMIT_US     1000000
> >   #define CSCHED_CREDITS_PER_MSEC     10
> >   /* Never set a timer shorter than this value. */
> >   #define CSCHED_MIN_TIMER            XEN_SYSCTL_SCHED_RATELIMIT_MIN
> > @@ -153,6 +157,7 @@ struct csched_pcpu {
> >
> >       unsigned int idle_bias;
> >       unsigned int nr_runnable;
> > +    s_time_t last_load_balance;
> >
> >       unsigned int tick;
> >       struct timer ticker;
> > @@ -218,7 +223,7 @@ struct csched_private {
> >
> >       /* Period of master and tick in milliseconds */
> >       unsigned int tick_period_us, ticks_per_tslice;
> > -    s_time_t ratelimit, tslice, unit_migr_delay;
> > +    s_time_t ratelimit, tslice, unit_migr_delay, load_balance_ratelimit;
> >
> >       struct list_head active_sdom;
> >       uint32_t weight;
> > @@ -612,6 +617,8 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
> >       BUG_ON(!is_idle_unit(curr_on_cpu(cpu)));
> >       cpumask_set_cpu(cpu, prv->idlers);
> >       spc->nr_runnable = 0;
> > +
> > +    spc->last_load_balance = NOW();
> >   }
> >
> >   static void cf_check
> > @@ -1676,9 +1683,17 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
> >       return NULL;
> >   }
> >
> > +/*
> > + * Minimum delay, in microseconds, between load balance operations.
> > + * This prevents spending too much time doing load balancing, particularly
> > + * when the system has a high number of YIELDs due to spinlock priority inversion.
> > + */
> > +static unsigned int __ro_after_init load_balance_ratelimit_us = CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;
> > +integer_param("load-balance-ratelimit", load_balance_ratelimit_us);
> > +
> >   static struct csched_unit *
> >   csched_load_balance(struct csched_private *prv, int cpu,
> > -    struct csched_unit *snext, bool *stolen)
> > +                    struct csched_unit *snext, bool *stolen)
> >   {
> >       const struct cpupool *c = get_sched_res(cpu)->cpupool;
> >       struct csched_unit *speer;
> > @@ -1958,15 +1973,19 @@ static void cf_check csched_schedule(
> >           /*
> >            * SMP Load balance:
> >            *
> > -         * If the next highest priority local runnable UNIT has already eaten
> > -         * through its credits, look on other PCPUs to see if we have more
> > -         * urgent work... If not, csched_load_balance() will return snext, but
> > -         * already removed from the runq.
> > +         * If the next highest priority local runnable UNIT has
> > +         * already eaten through its credits (and we're below the
> > +         * balancing ratelimit), look on other PCPUs to see if we have
> > +         * more urgent work... If we don't, csched_load_balance() will
> > +         * return snext, but already removed from the runq.
> >            */
> > -        if ( snext->pri > CSCHED_PRI_TS_OVER )
> > -            __runq_remove(snext);
> > -        else
> > +        if ( snext->pri <= CSCHED_PRI_TS_OVER
> > +             && now - spc->last_load_balance > prv->load_balance_ratelimit) {
>
> ^ Just found a style issue (after Andrew pointing out the ones in patch 2).

Just checking, you mean the space before the closing `)`?

 -George
Jürgen Groß Sept. 22, 2023, 12:25 p.m. UTC | #5
On 22.09.23 14:20, George Dunlap wrote:
> On Thu, Sep 21, 2023 at 2:12 PM Juergen Gross <jgross@suse.com> wrote:
>>
>> On 21.09.23 14:23, George Dunlap wrote:
>>> The credit scheduler tries as hard as it can to ensure that it always
>>> runs scheduling units with positive credit (PRI_TS_UNDER) before
>>> running those with negative credit (PRI_TS_OVER).  If the next
>>> runnable scheduling unit is of priority OVER, it will always run the
>>> load balancer, which will scour the system looking for another
>>> scheduling unit of the UNDER priority.
>>>
>>> Unfortunately, as the number of cores on a system has grown, the cost
>>> of the work-stealing algorithm has dramatically increased; a recent
>>> trace on a system with 128 cores showed this taking over 50
>>> microseconds.
>>>
>>> Add a parameter, load_balance_ratelimit, to limit the frequency of
>>> load balance operations on a given pcpu.  Default this to 1
>>> millisecond.
>>>
>>> Invert the load balancing conditional to make it more clear, and line
>>> up more closely with the comment above it.
>>>
>>> Overall it might be cleaner to have the last_load_balance checking
>>> happen inside csched_load_balance(), but that would require either
>>> passing both now and spc into the function, or looking them up again;
>>> both of which seemed to be worse than simply checking and setting the
>>> values before calling it.
>>>
>>> On a system with a vcpu:pcpu ratio of 2:1, running Windows guests
>>> (which will end up calling YIELD during spinlock contention), this
>>> patch increased performance significantly.
>>>
>>> Signed-off-by: George Dunlap <george.dunlap@cloud.com>
>>> ---
>>> Changes since v1:
>>> - Fix editing mistake in commit message
>>> - Improve documentation
>>> - global var is __ro_after_init
>>> - Remove sysctl, as it's not used.  Define max value in credit.c.
>>> - Fix some style issues
>>> - Move comment tweak to the right patch
>>> - In the event that the commandline-parameter value is too high, clip
>>>     to the maximum value rather than setting to the default.
>>>
>>> CC: Dario Faggioli <dfaggioli@suse.com>
>>> CC: Andrew Cooper <andrew.cooper3@citrix.com>
>>> CC: George Dunlap <george.dunlap@citrix.com>
>>> CC: Jan Beulich <jbeulich@suse.com>
>>> CC: Julien Grall <julien@xen.org>
>>> CC: Stefano Stabellini <sstabellini@kernel.org>
>>> CC: Wei Liu <wl@xen.org>
>>> ---
>>>    docs/misc/xen-command-line.pandoc |  8 ++++++
>>>    xen/common/sched/credit.c         | 47 +++++++++++++++++++++++++------
>>>    2 files changed, 46 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
>>> index f88e6a70ae..9c3c72a7f9 100644
>>> --- a/docs/misc/xen-command-line.pandoc
>>> +++ b/docs/misc/xen-command-line.pandoc
>>> @@ -1884,6 +1884,14 @@ By default, Xen will use the INVPCID instruction for TLB management if
>>>    it is available.  This option can be used to cause Xen to fall back to
>>>    older mechanisms, which are generally slower.
>>>
>>> +### load-balance-ratelimit
>>> +> `= <integer>`
>>> +
>>> +The minimum interval between load balancing events on a given pcpu, in
>>> +microseconds.  A value of '0' will disable rate limiting.  Maximum
>>> +value 1 second. At the moment only credit honors this parameter.
>>> +Default 1ms.
>>> +
>>>    ### noirqbalance (x86)
>>>    > `= <boolean>`
>>>
>>> diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c
>>> index f2cd3d9da3..5c06f596d2 100644
>>> --- a/xen/common/sched/credit.c
>>> +++ b/xen/common/sched/credit.c
>>> @@ -50,6 +50,10 @@
>>>    #define CSCHED_TICKS_PER_TSLICE     3
>>>    /* Default timeslice: 30ms */
>>>    #define CSCHED_DEFAULT_TSLICE_MS    30
>>> +/* Default load balancing ratelimit: 1ms */
>>> +#define CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US 1000
>>> +/* Max load balancing ratelimit: 1s */
>>> +#define CSCHED_MAX_LOAD_BALANCE_RATELIMIT_US     1000000
>>>    #define CSCHED_CREDITS_PER_MSEC     10
>>>    /* Never set a timer shorter than this value. */
>>>    #define CSCHED_MIN_TIMER            XEN_SYSCTL_SCHED_RATELIMIT_MIN
>>> @@ -153,6 +157,7 @@ struct csched_pcpu {
>>>
>>>        unsigned int idle_bias;
>>>        unsigned int nr_runnable;
>>> +    s_time_t last_load_balance;
>>>
>>>        unsigned int tick;
>>>        struct timer ticker;
>>> @@ -218,7 +223,7 @@ struct csched_private {
>>>
>>>        /* Period of master and tick in milliseconds */
>>>        unsigned int tick_period_us, ticks_per_tslice;
>>> -    s_time_t ratelimit, tslice, unit_migr_delay;
>>> +    s_time_t ratelimit, tslice, unit_migr_delay, load_balance_ratelimit;
>>>
>>>        struct list_head active_sdom;
>>>        uint32_t weight;
>>> @@ -612,6 +617,8 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
>>>        BUG_ON(!is_idle_unit(curr_on_cpu(cpu)));
>>>        cpumask_set_cpu(cpu, prv->idlers);
>>>        spc->nr_runnable = 0;
>>> +
>>> +    spc->last_load_balance = NOW();
>>>    }
>>>
>>>    static void cf_check
>>> @@ -1676,9 +1683,17 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>>>        return NULL;
>>>    }
>>>
>>> +/*
>>> + * Minimum delay, in microseconds, between load balance operations.
>>> + * This prevents spending too much time doing load balancing, particularly
>>> + * when the system has a high number of YIELDs due to spinlock priority inversion.
>>> + */
>>> +static unsigned int __ro_after_init load_balance_ratelimit_us = CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;
>>> +integer_param("load-balance-ratelimit", load_balance_ratelimit_us);
>>> +
>>>    static struct csched_unit *
>>>    csched_load_balance(struct csched_private *prv, int cpu,
>>> -    struct csched_unit *snext, bool *stolen)
>>> +                    struct csched_unit *snext, bool *stolen)
>>>    {
>>>        const struct cpupool *c = get_sched_res(cpu)->cpupool;
>>>        struct csched_unit *speer;
>>> @@ -1958,15 +1973,19 @@ static void cf_check csched_schedule(
>>>            /*
>>>             * SMP Load balance:
>>>             *
>>> -         * If the next highest priority local runnable UNIT has already eaten
>>> -         * through its credits, look on other PCPUs to see if we have more
>>> -         * urgent work... If not, csched_load_balance() will return snext, but
>>> -         * already removed from the runq.
>>> +         * If the next highest priority local runnable UNIT has
>>> +         * already eaten through its credits (and we're below the
>>> +         * balancing ratelimit), look on other PCPUs to see if we have
>>> +         * more urgent work... If we don't, csched_load_balance() will
>>> +         * return snext, but already removed from the runq.
>>>             */
>>> -        if ( snext->pri > CSCHED_PRI_TS_OVER )
>>> -            __runq_remove(snext);
>>> -        else
>>> +        if ( snext->pri <= CSCHED_PRI_TS_OVER
>>> +             && now - spc->last_load_balance > prv->load_balance_ratelimit) {
>>
>> ^ Just found a style issue (after Andrew pointing out the ones in patch 2).
> 
> Just checking, you mean the space before the closing `)`?

Yes.


Juergen
Jan Beulich Sept. 22, 2023, 1:49 p.m. UTC | #6
On 22.09.2023 14:20, George Dunlap wrote:
> On Thu, Sep 21, 2023 at 2:12 PM Juergen Gross <jgross@suse.com> wrote:
>>
>> On 21.09.23 14:23, George Dunlap wrote:
>>> The credit scheduler tries as hard as it can to ensure that it always
>>> runs scheduling units with positive credit (PRI_TS_UNDER) before
>>> running those with negative credit (PRI_TS_OVER).  If the next
>>> runnable scheduling unit is of priority OVER, it will always run the
>>> load balancer, which will scour the system looking for another
>>> scheduling unit of the UNDER priority.
>>>
>>> Unfortunately, as the number of cores on a system has grown, the cost
>>> of the work-stealing algorithm has dramatically increased; a recent
>>> trace on a system with 128 cores showed this taking over 50
>>> microseconds.
>>>
>>> Add a parameter, load_balance_ratelimit, to limit the frequency of
>>> load balance operations on a given pcpu.  Default this to 1
>>> millisecond.
>>>
>>> Invert the load balancing conditional to make it more clear, and line
>>> up more closely with the comment above it.
>>>
>>> Overall it might be cleaner to have the last_load_balance checking
>>> happen inside csched_load_balance(), but that would require either
>>> passing both now and spc into the function, or looking them up again;
>>> both of which seemed to be worse than simply checking and setting the
>>> values before calling it.
>>>
>>> On a system with a vcpu:pcpu ratio of 2:1, running Windows guests
>>> (which will end up calling YIELD during spinlock contention), this
>>> patch increased performance significantly.
>>>
>>> Signed-off-by: George Dunlap <george.dunlap@cloud.com>
>>> ---
>>> Changes since v1:
>>> - Fix editing mistake in commit message
>>> - Improve documentation
>>> - global var is __ro_after_init
>>> - Remove sysctl, as it's not used.  Define max value in credit.c.
>>> - Fix some style issues
>>> - Move comment tweak to the right patch
>>> - In the event that the commandline-parameter value is too high, clip
>>>    to the maximum value rather than setting to the default.
>>>
>>> CC: Dario Faggioli <dfaggioli@suse.com>
>>> CC: Andrew Cooper <andrew.cooper3@citrix.com>
>>> CC: George Dunlap <george.dunlap@citrix.com>
>>> CC: Jan Beulich <jbeulich@suse.com>
>>> CC: Julien Grall <julien@xen.org>
>>> CC: Stefano Stabellini <sstabellini@kernel.org>
>>> CC: Wei Liu <wl@xen.org>
>>> ---
>>>   docs/misc/xen-command-line.pandoc |  8 ++++++
>>>   xen/common/sched/credit.c         | 47 +++++++++++++++++++++++++------
>>>   2 files changed, 46 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
>>> index f88e6a70ae..9c3c72a7f9 100644
>>> --- a/docs/misc/xen-command-line.pandoc
>>> +++ b/docs/misc/xen-command-line.pandoc
>>> @@ -1884,6 +1884,14 @@ By default, Xen will use the INVPCID instruction for TLB management if
>>>   it is available.  This option can be used to cause Xen to fall back to
>>>   older mechanisms, which are generally slower.
>>>
>>> +### load-balance-ratelimit
>>> +> `= <integer>`
>>> +
>>> +The minimum interval between load balancing events on a given pcpu, in
>>> +microseconds.  A value of '0' will disable rate limiting.  Maximum
>>> +value 1 second. At the moment only credit honors this parameter.
>>> +Default 1ms.
>>> +
>>>   ### noirqbalance (x86)
>>>   > `= <boolean>`
>>>
>>> diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c
>>> index f2cd3d9da3..5c06f596d2 100644
>>> --- a/xen/common/sched/credit.c
>>> +++ b/xen/common/sched/credit.c
>>> @@ -50,6 +50,10 @@
>>>   #define CSCHED_TICKS_PER_TSLICE     3
>>>   /* Default timeslice: 30ms */
>>>   #define CSCHED_DEFAULT_TSLICE_MS    30
>>> +/* Default load balancing ratelimit: 1ms */
>>> +#define CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US 1000
>>> +/* Max load balancing ratelimit: 1s */
>>> +#define CSCHED_MAX_LOAD_BALANCE_RATELIMIT_US     1000000
>>>   #define CSCHED_CREDITS_PER_MSEC     10
>>>   /* Never set a timer shorter than this value. */
>>>   #define CSCHED_MIN_TIMER            XEN_SYSCTL_SCHED_RATELIMIT_MIN
>>> @@ -153,6 +157,7 @@ struct csched_pcpu {
>>>
>>>       unsigned int idle_bias;
>>>       unsigned int nr_runnable;
>>> +    s_time_t last_load_balance;
>>>
>>>       unsigned int tick;
>>>       struct timer ticker;
>>> @@ -218,7 +223,7 @@ struct csched_private {
>>>
>>>       /* Period of master and tick in milliseconds */
>>>       unsigned int tick_period_us, ticks_per_tslice;
>>> -    s_time_t ratelimit, tslice, unit_migr_delay;
>>> +    s_time_t ratelimit, tslice, unit_migr_delay, load_balance_ratelimit;
>>>
>>>       struct list_head active_sdom;
>>>       uint32_t weight;
>>> @@ -612,6 +617,8 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
>>>       BUG_ON(!is_idle_unit(curr_on_cpu(cpu)));
>>>       cpumask_set_cpu(cpu, prv->idlers);
>>>       spc->nr_runnable = 0;
>>> +
>>> +    spc->last_load_balance = NOW();
>>>   }
>>>
>>>   static void cf_check
>>> @@ -1676,9 +1683,17 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>>>       return NULL;
>>>   }
>>>
>>> +/*
>>> + * Minimum delay, in microseconds, between load balance operations.
>>> + * This prevents spending too much time doing load balancing, particularly
>>> + * when the system has a high number of YIELDs due to spinlock priority inversion.
>>> + */
>>> +static unsigned int __ro_after_init load_balance_ratelimit_us = CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;
>>> +integer_param("load-balance-ratelimit", load_balance_ratelimit_us);
>>> +
>>>   static struct csched_unit *
>>>   csched_load_balance(struct csched_private *prv, int cpu,
>>> -    struct csched_unit *snext, bool *stolen)
>>> +                    struct csched_unit *snext, bool *stolen)
>>>   {
>>>       const struct cpupool *c = get_sched_res(cpu)->cpupool;
>>>       struct csched_unit *speer;
>>> @@ -1958,15 +1973,19 @@ static void cf_check csched_schedule(
>>>           /*
>>>            * SMP Load balance:
>>>            *
>>> -         * If the next highest priority local runnable UNIT has already eaten
>>> -         * through its credits, look on other PCPUs to see if we have more
>>> -         * urgent work... If not, csched_load_balance() will return snext, but
>>> -         * already removed from the runq.
>>> +         * If the next highest priority local runnable UNIT has
>>> +         * already eaten through its credits (and we're below the
>>> +         * balancing ratelimit), look on other PCPUs to see if we have
>>> +         * more urgent work... If we don't, csched_load_balance() will
>>> +         * return snext, but already removed from the runq.
>>>            */
>>> -        if ( snext->pri > CSCHED_PRI_TS_OVER )
>>> -            __runq_remove(snext);
>>> -        else
>>> +        if ( snext->pri <= CSCHED_PRI_TS_OVER
>>> +             && now - spc->last_load_balance > prv->load_balance_ratelimit) {
>>
>> ^ Just found a style issue (after Andrew pointing out the ones in patch 2).
> 
> Just checking, you mean the space before the closing `)`?

And the placement of the opening {.

Jan
George Dunlap Sept. 22, 2023, 2:46 p.m. UTC | #7
On Fri, Sep 22, 2023 at 2:49 PM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 22.09.2023 14:20, George Dunlap wrote:
> > On Thu, Sep 21, 2023 at 2:12 PM Juergen Gross <jgross@suse.com> wrote:
> >>
> >> On 21.09.23 14:23, George Dunlap wrote:
> >>> The credit scheduler tries as hard as it can to ensure that it always
> >>> runs scheduling units with positive credit (PRI_TS_UNDER) before
> >>> running those with negative credit (PRI_TS_OVER).  If the next
> >>> runnable scheduling unit is of priority OVER, it will always run the
> >>> load balancer, which will scour the system looking for another
> >>> scheduling unit of the UNDER priority.
> >>>
> >>> Unfortunately, as the number of cores on a system has grown, the cost
> >>> of the work-stealing algorithm has dramatically increased; a recent
> >>> trace on a system with 128 cores showed this taking over 50
> >>> microseconds.
> >>>
> >>> Add a parameter, load_balance_ratelimit, to limit the frequency of
> >>> load balance operations on a given pcpu.  Default this to 1
> >>> millisecond.
> >>>
> >>> Invert the load balancing conditional to make it more clear, and line
> >>> up more closely with the comment above it.
> >>>
> >>> Overall it might be cleaner to have the last_load_balance checking
> >>> happen inside csched_load_balance(), but that would require either
> >>> passing both now and spc into the function, or looking them up again;
> >>> both of which seemed to be worse than simply checking and setting the
> >>> values before calling it.
> >>>
> >>> On a system with a vcpu:pcpu ratio of 2:1, running Windows guests
> >>> (which will end up calling YIELD during spinlock contention), this
> >>> patch increased performance significantly.
> >>>
> >>> Signed-off-by: George Dunlap <george.dunlap@cloud.com>
> >>> ---
> >>> Changes since v1:
> >>> - Fix editing mistake in commit message
> >>> - Improve documentation
> >>> - global var is __ro_after_init
> >>> - Remove sysctl, as it's not used.  Define max value in credit.c.
> >>> - Fix some style issues
> >>> - Move comment tweak to the right patch
> >>> - In the event that the commandline-parameter value is too high, clip
> >>>    to the maximum value rather than setting to the default.
> >>>
> >>> CC: Dario Faggioli <dfaggioli@suse.com>
> >>> CC: Andrew Cooper <andrew.cooper3@citrix.com>
> >>> CC: George Dunlap <george.dunlap@citrix.com>
> >>> CC: Jan Beulich <jbeulich@suse.com>
> >>> CC: Julien Grall <julien@xen.org>
> >>> CC: Stefano Stabellini <sstabellini@kernel.org>
> >>> CC: Wei Liu <wl@xen.org>
> >>> ---
> >>>   docs/misc/xen-command-line.pandoc |  8 ++++++
> >>>   xen/common/sched/credit.c         | 47 +++++++++++++++++++++++++------
> >>>   2 files changed, 46 insertions(+), 9 deletions(-)
> >>>
> >>> diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
> >>> index f88e6a70ae..9c3c72a7f9 100644
> >>> --- a/docs/misc/xen-command-line.pandoc
> >>> +++ b/docs/misc/xen-command-line.pandoc
> >>> @@ -1884,6 +1884,14 @@ By default, Xen will use the INVPCID instruction for TLB management if
> >>>   it is available.  This option can be used to cause Xen to fall back to
> >>>   older mechanisms, which are generally slower.
> >>>
> >>> +### load-balance-ratelimit
> >>> +> `= <integer>`
> >>> +
> >>> +The minimum interval between load balancing events on a given pcpu, in
> >>> +microseconds.  A value of '0' will disable rate limiting.  Maximum
> >>> +value 1 second. At the moment only credit honors this parameter.
> >>> +Default 1ms.
> >>> +
> >>>   ### noirqbalance (x86)
> >>>   > `= <boolean>`
> >>>
> >>> diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c
> >>> index f2cd3d9da3..5c06f596d2 100644
> >>> --- a/xen/common/sched/credit.c
> >>> +++ b/xen/common/sched/credit.c
> >>> @@ -50,6 +50,10 @@
> >>>   #define CSCHED_TICKS_PER_TSLICE     3
> >>>   /* Default timeslice: 30ms */
> >>>   #define CSCHED_DEFAULT_TSLICE_MS    30
> >>> +/* Default load balancing ratelimit: 1ms */
> >>> +#define CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US 1000
> >>> +/* Max load balancing ratelimit: 1s */
> >>> +#define CSCHED_MAX_LOAD_BALANCE_RATELIMIT_US     1000000
> >>>   #define CSCHED_CREDITS_PER_MSEC     10
> >>>   /* Never set a timer shorter than this value. */
> >>>   #define CSCHED_MIN_TIMER            XEN_SYSCTL_SCHED_RATELIMIT_MIN
> >>> @@ -153,6 +157,7 @@ struct csched_pcpu {
> >>>
> >>>       unsigned int idle_bias;
> >>>       unsigned int nr_runnable;
> >>> +    s_time_t last_load_balance;
> >>>
> >>>       unsigned int tick;
> >>>       struct timer ticker;
> >>> @@ -218,7 +223,7 @@ struct csched_private {
> >>>
> >>>       /* Period of master and tick in milliseconds */
> >>>       unsigned int tick_period_us, ticks_per_tslice;
> >>> -    s_time_t ratelimit, tslice, unit_migr_delay;
> >>> +    s_time_t ratelimit, tslice, unit_migr_delay, load_balance_ratelimit;
> >>>
> >>>       struct list_head active_sdom;
> >>>       uint32_t weight;
> >>> @@ -612,6 +617,8 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
> >>>       BUG_ON(!is_idle_unit(curr_on_cpu(cpu)));
> >>>       cpumask_set_cpu(cpu, prv->idlers);
> >>>       spc->nr_runnable = 0;
> >>> +
> >>> +    spc->last_load_balance = NOW();
> >>>   }
> >>>
> >>>   static void cf_check
> >>> @@ -1676,9 +1683,17 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
> >>>       return NULL;
> >>>   }
> >>>
> >>> +/*
> >>> + * Minimum delay, in microseconds, between load balance operations.
> >>> + * This prevents spending too much time doing load balancing, particularly
> >>> + * when the system has a high number of YIELDs due to spinlock priority inversion.
> >>> + */
> >>> +static unsigned int __ro_after_init load_balance_ratelimit_us = CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;
> >>> +integer_param("load-balance-ratelimit", load_balance_ratelimit_us);
> >>> +
> >>>   static struct csched_unit *
> >>>   csched_load_balance(struct csched_private *prv, int cpu,
> >>> -    struct csched_unit *snext, bool *stolen)
> >>> +                    struct csched_unit *snext, bool *stolen)
> >>>   {
> >>>       const struct cpupool *c = get_sched_res(cpu)->cpupool;
> >>>       struct csched_unit *speer;
> >>> @@ -1958,15 +1973,19 @@ static void cf_check csched_schedule(
> >>>           /*
> >>>            * SMP Load balance:
> >>>            *
> >>> -         * If the next highest priority local runnable UNIT has already eaten
> >>> -         * through its credits, look on other PCPUs to see if we have more
> >>> -         * urgent work... If not, csched_load_balance() will return snext, but
> >>> -         * already removed from the runq.
> >>> +         * If the next highest priority local runnable UNIT has
> >>> +         * already eaten through its credits (and we're below the
> >>> +         * balancing ratelimit), look on other PCPUs to see if we have
> >>> +         * more urgent work... If we don't, csched_load_balance() will
> >>> +         * return snext, but already removed from the runq.
> >>>            */
> >>> -        if ( snext->pri > CSCHED_PRI_TS_OVER )
> >>> -            __runq_remove(snext);
> >>> -        else
> >>> +        if ( snext->pri <= CSCHED_PRI_TS_OVER
> >>> +             && now - spc->last_load_balance > prv->load_balance_ratelimit) {
> >>
> >> ^ Just found a style issue (after Andrew pointing out the ones in patch 2).
> >
> > Just checking, you mean the space before the closing `)`?
>
> And the placement of the opening {.

Gah, sorry I missed that. :-( (Golang won't compile if the `{` isn't
on the same line.)

I guess I'll send a follow-up patch.

 -George
diff mbox series

Patch

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index f88e6a70ae..9c3c72a7f9 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1884,6 +1884,14 @@  By default, Xen will use the INVPCID instruction for TLB management if
 it is available.  This option can be used to cause Xen to fall back to
 older mechanisms, which are generally slower.
 
+### load-balance-ratelimit
+> `= <integer>`
+
+The minimum interval between load balancing events on a given pcpu, in
+microseconds.  A value of '0' will disable rate limiting.  Maximum
+value 1 second. At the moment only credit honors this parameter.
+Default 1ms.
+
 ### noirqbalance (x86)
 > `= <boolean>`
 
diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c
index f2cd3d9da3..5c06f596d2 100644
--- a/xen/common/sched/credit.c
+++ b/xen/common/sched/credit.c
@@ -50,6 +50,10 @@ 
 #define CSCHED_TICKS_PER_TSLICE     3
 /* Default timeslice: 30ms */
 #define CSCHED_DEFAULT_TSLICE_MS    30
+/* Default load balancing ratelimit: 1ms */
+#define CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US 1000
+/* Max load balancing ratelimit: 1s */
+#define CSCHED_MAX_LOAD_BALANCE_RATELIMIT_US     1000000
 #define CSCHED_CREDITS_PER_MSEC     10
 /* Never set a timer shorter than this value. */
 #define CSCHED_MIN_TIMER            XEN_SYSCTL_SCHED_RATELIMIT_MIN
@@ -153,6 +157,7 @@  struct csched_pcpu {
 
     unsigned int idle_bias;
     unsigned int nr_runnable;
+    s_time_t last_load_balance;
 
     unsigned int tick;
     struct timer ticker;
@@ -218,7 +223,7 @@  struct csched_private {
 
     /* Period of master and tick in milliseconds */
     unsigned int tick_period_us, ticks_per_tslice;
-    s_time_t ratelimit, tslice, unit_migr_delay;
+    s_time_t ratelimit, tslice, unit_migr_delay, load_balance_ratelimit;
 
     struct list_head active_sdom;
     uint32_t weight;
@@ -612,6 +617,8 @@  init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
     BUG_ON(!is_idle_unit(curr_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
     spc->nr_runnable = 0;
+
+    spc->last_load_balance = NOW();
 }
 
 static void cf_check
@@ -1676,9 +1683,17 @@  csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
     return NULL;
 }
 
+/*
+ * Minimum delay, in microseconds, between load balance operations.
+ * This prevents spending too much time doing load balancing, particularly
+ * when the system has a high number of YIELDs due to spinlock priority inversion.
+ */
+static unsigned int __ro_after_init load_balance_ratelimit_us = CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;
+integer_param("load-balance-ratelimit", load_balance_ratelimit_us);
+
 static struct csched_unit *
 csched_load_balance(struct csched_private *prv, int cpu,
-    struct csched_unit *snext, bool *stolen)
+                    struct csched_unit *snext, bool *stolen)
 {
     const struct cpupool *c = get_sched_res(cpu)->cpupool;
     struct csched_unit *speer;
@@ -1958,15 +1973,19 @@  static void cf_check csched_schedule(
         /*
          * SMP Load balance:
          *
-         * If the next highest priority local runnable UNIT has already eaten
-         * through its credits, look on other PCPUs to see if we have more
-         * urgent work... If not, csched_load_balance() will return snext, but
-         * already removed from the runq.
+         * If the next highest priority local runnable UNIT has
+         * already eaten through its credits (and we're below the
+         * balancing ratelimit), look on other PCPUs to see if we have
+         * more urgent work... If we don't, csched_load_balance() will
+         * return snext, but already removed from the runq.
          */
-        if ( snext->pri > CSCHED_PRI_TS_OVER )
-            __runq_remove(snext);
-        else
+        if ( snext->pri <= CSCHED_PRI_TS_OVER
+             && now - spc->last_load_balance > prv->load_balance_ratelimit) {
+            spc->last_load_balance = now;
             snext = csched_load_balance(prv, sched_cpu, snext, &migrated);
+        }
+        else
+            __runq_remove(snext);
 
     } while ( !unit_runnable_state(snext->unit) );
 
@@ -2181,6 +2200,14 @@  csched_global_init(void)
                XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US, vcpu_migration_delay_us);
     }
 
+    if ( load_balance_ratelimit_us > CSCHED_MAX_LOAD_BALANCE_RATELIMIT_US )
+    {
+        load_balance_ratelimit_us = CSCHED_MAX_LOAD_BALANCE_RATELIMIT_US;
+        printk("WARNING: load-balance-ratelimit outside of valid range [0,%d]us.\n"
+               "Setting to max.\n",
+               CSCHED_MAX_LOAD_BALANCE_RATELIMIT_US);
+    }
+
     return 0;
 }
 
@@ -2223,6 +2250,8 @@  csched_init(struct scheduler *ops)
 
     prv->unit_migr_delay = MICROSECS(vcpu_migration_delay_us);
 
+    prv->load_balance_ratelimit = MICROSECS(load_balance_ratelimit_us);
+
     return 0;
 }