diff mbox series

[1/2] credit: Limit load balancing to once per millisecond

Message ID 20230630113756.672607-1-george.dunlap@cloud.com (mailing list archive)
State New, archived
Headers show
Series [1/2] credit: Limit load balancing to once per millisecond | expand

Commit Message

George Dunlap June 30, 2023, 11:37 a.m. UTC
The credit scheduler tries as hard as it can to ensure that it always
runs scheduling units with positive credit (PRI_TS_UNDER) before
running those with negative credit (PRI_TS_OVER).  If the next
runnable scheduling unit is of priority OVER, it will always run the
load balancer, which will scour the system looking for another
scheduling unit of the UNDER priority.

Unfortunately, as the number of cores on a system has grown, the cost
of the work-stealing algorithm has dramatically increased; a recent
trace on a system with 128 cores showed this taking over 50
microseconds.

Add a parameter, load_balance_ratelimit, to limit the frequency of
load balance operations on a given pcpu.  Default this to 1
millisecond.

Invert the load balancing conditional to make it more clear, and line
up more closely with the comment above it.

Overall it might be cleaner to have the last_load_balance checking
happen inside csched_load_balance(), but that would require either
passing both now and spc into the function, or looking them up again;
both of which seemed to be worse than simply checking and setting the
values before calling it.

Without this patch, on a system with a vcpu:pcpu ratio of 2:1, running
Windows guests (which will end up calling YIELD during spinlock
contention), this patch increased performance significantly.

Signed-off-by: George Dunlap <george.dunlap@cloud.com>
---
CC: Dario Faggioli <dfaggioli@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: George Dunlap <george.dunlap@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Wei Liu <wl@xen.org>
---
 docs/misc/xen-command-line.pandoc |  6 +++++
 xen/common/sched/credit.c         | 40 ++++++++++++++++++++++++++-----
 xen/include/public/sysctl.h       |  6 +++++
 3 files changed, 46 insertions(+), 6 deletions(-)

Comments

Marcus Granado June 30, 2023, 4:33 p.m. UTC | #1
On Fri, 30 Jun 2023 at 12:38, George Dunlap <george.dunlap@cloud.com> wrote:
> ...
> Add a parameter, load_balance_ratelimit, to limit the frequency of
> load balance operations on a given pcpu.  Default this to 1
> millisecond.
> ...
> Without this patch, on a system with a vcpu:pcpu ratio of 2:1, running
> Windows guests (which will end up calling YIELD during spinlock
> contention), this patch increased performance significantly.
> ---

I tested this patch with the following:
- the vcpu:pcpu ratio of circa 2:1 in the host,
- host with 72 pcpus
- and 9 Windows guests with 16 vcpus each,
- with background load inside the guests, where a few Windows apps
were starting and stopping in parallel in a loop in order to hit 100%
vcpu usage in each guest, so that the host pcpu usage was constantly
at 100%.
and observed a significant improvement in the responsiveness
experienced by a user moving the mouse and using the keyboard inside
the Windows guests, compared to when this patch is not present.

In addition, when launching a trivial executable like 'ver' remotely
in the Windows guests during the background load above, I noticed
that:
- when vcpu:pcpu is 1:1 and host has 100% pcpu usage -> with and
without this patch, launching 'ver' remotely took 1 second or so
- when vcpu:pcpu is 2:1 and host has 100% pcpu usage -> without this
patch, launching 'ver' remotely took several seconds
- when vcpu:pcpu is 2:1 and host has 100% pcpu usage -> with this
patch, launching 'ver' remotely took a couple of second only -- so
with this patch the host seems to be handling the extreme load at 100%
pcpu usage more gracefully and the latency inside the guest is more
linearly correlated to vcpu:pcpu..
Julien Grall June 30, 2023, 5:48 p.m. UTC | #2
Hi George,

On 30/06/2023 12:37, George Dunlap wrote:
> The credit scheduler tries as hard as it can to ensure that it always
> runs scheduling units with positive credit (PRI_TS_UNDER) before
> running those with negative credit (PRI_TS_OVER).  If the next
> runnable scheduling unit is of priority OVER, it will always run the
> load balancer, which will scour the system looking for another
> scheduling unit of the UNDER priority.
> 
> Unfortunately, as the number of cores on a system has grown, the cost
> of the work-stealing algorithm has dramatically increased; a recent
> trace on a system with 128 cores showed this taking over 50
> microseconds.
> 
> Add a parameter, load_balance_ratelimit, to limit the frequency of
> load balance operations on a given pcpu.  Default this to 1
> millisecond.
> 
> Invert the load balancing conditional to make it more clear, and line
> up more closely with the comment above it.
> 
> Overall it might be cleaner to have the last_load_balance checking
> happen inside csched_load_balance(), but that would require either
> passing both now and spc into the function, or looking them up again;
> both of which seemed to be worse than simply checking and setting the
> values before calling it.
> 
> Without this patch, on a system with a vcpu:pcpu ratio of 2:1, running
> Windows guests (which will end up calling YIELD during spinlock
> contention), this patch increased performance significantly.

I don't understand this sentence. Did you intende to write

"Without this patch, ..., the performance are significantly worse"?

> 
> Signed-off-by: George Dunlap <george.dunlap@cloud.com>
> ---
> CC: Dario Faggioli <dfaggioli@suse.com>
> CC: Andrew Cooper <andrew.cooper3@citrix.com>
> CC: George Dunlap <george.dunlap@citrix.com>
> CC: Jan Beulich <jbeulich@suse.com>
> CC: Julien Grall <julien@xen.org>
> CC: Stefano Stabellini <sstabellini@kernel.org>
> CC: Wei Liu <wl@xen.org>
> ---
>   docs/misc/xen-command-line.pandoc |  6 +++++
>   xen/common/sched/credit.c         | 40 ++++++++++++++++++++++++++-----
>   xen/include/public/sysctl.h       |  6 +++++
>   3 files changed, 46 insertions(+), 6 deletions(-)
> 
> diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
> index 4060ebdc5d..369557020f 100644
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -1856,6 +1856,12 @@ By default, Xen will use the INVPCID instruction for TLB management if
>   it is available.  This option can be used to cause Xen to fall back to
>   older mechanisms, which are generally slower.
>   
> +### load-balance-ratelimit
> +> `= <integer>`
> +
> +The minimum interval between load balancing events on a given pcpu.
> +At the moment only credit honors this parameter.

I would suggest to mention the default value. So a reader don't have to 
look up in the code to find out.

Also, AFAICT, there is max value. I would mention it here too.

> +
>   ### noirqbalance (x86)
>   > `= <boolean>`
>   
> diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c
> index f2cd3d9da3..b8bdfd5f6a 100644
> --- a/xen/common/sched/credit.c
> +++ b/xen/common/sched/credit.c
> @@ -50,6 +50,8 @@
>   #define CSCHED_TICKS_PER_TSLICE     3
>   /* Default timeslice: 30ms */
>   #define CSCHED_DEFAULT_TSLICE_MS    30
> +/* Default load balancing ratelimit: 1ms */
> +#define CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US 1000
>   #define CSCHED_CREDITS_PER_MSEC     10
>   /* Never set a timer shorter than this value. */
>   #define CSCHED_MIN_TIMER            XEN_SYSCTL_SCHED_RATELIMIT_MIN
> @@ -153,6 +155,7 @@ struct csched_pcpu {
>   
>       unsigned int idle_bias;
>       unsigned int nr_runnable;
> +    s_time_t last_load_balance;
>   
>       unsigned int tick;
>       struct timer ticker;
> @@ -218,7 +221,7 @@ struct csched_private {
>   
>       /* Period of master and tick in milliseconds */
>       unsigned int tick_period_us, ticks_per_tslice;
> -    s_time_t ratelimit, tslice, unit_migr_delay;
> +    s_time_t ratelimit, tslice, unit_migr_delay, load_balance_ratelimit;
>   
>       struct list_head active_sdom;
>       uint32_t weight;
> @@ -612,6 +615,8 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
>       BUG_ON(!is_idle_unit(curr_on_cpu(cpu)));
>       cpumask_set_cpu(cpu, prv->idlers);
>       spc->nr_runnable = 0;
> +
> +    spc->last_load_balance = NOW();
>   }
>   
>   static void cf_check
> @@ -1267,7 +1272,8 @@ csched_sys_cntl(const struct scheduler *ops,
>                    && (params->ratelimit_us > XEN_SYSCTL_SCHED_RATELIMIT_MAX
>                        || params->ratelimit_us < XEN_SYSCTL_SCHED_RATELIMIT_MIN))
>                || MICROSECS(params->ratelimit_us) > MILLISECS(params->tslice_ms)
> -             || params->vcpu_migr_delay_us > XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US )
> +             || params->vcpu_migr_delay_us > XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US
> +             || params->load_balance_ratelimit_us > XEN_SYSCTL_CSCHED_LB_RATE_MAX_US)
>                   goto out;
>   
>           spin_lock_irqsave(&prv->lock, flags);
> @@ -1278,6 +1284,7 @@ csched_sys_cntl(const struct scheduler *ops,
>               printk(XENLOG_INFO "Disabling context switch rate limiting\n");
>           prv->ratelimit = MICROSECS(params->ratelimit_us);
>           prv->unit_migr_delay = MICROSECS(params->vcpu_migr_delay_us);
> +        prv->load_balance_ratelimit = MICROSECS(params->load_balance_ratelimit_us);
>           spin_unlock_irqrestore(&prv->lock, flags);
>   
>           /* FALLTHRU */
> @@ -1285,6 +1292,7 @@ csched_sys_cntl(const struct scheduler *ops,
>           params->tslice_ms = prv->tslice / MILLISECS(1);
>           params->ratelimit_us = prv->ratelimit / MICROSECS(1);
>           params->vcpu_migr_delay_us = prv->unit_migr_delay / MICROSECS(1);
> +        params->load_balance_ratelimit_us = prv->load_balance_ratelimit / MICROSECS(1);
>           rc = 0;
>           break;
>       }
> @@ -1676,9 +1684,17 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>       return NULL;
>   }
>   
> +/*
> + * Minimum delay, in microseconds, between load balance operations.
> + * This prevents spending too much time doing load balancing, particularly
> + * when the system has a high number of YIELDs due to spinlock priority inversion.
> + */
> +static unsigned int __read_mostly load_balance_ratelimit_us = CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;

AFAICT, load_balance_ratelimit_us is not updated after boot. So 
shouldn't the attribute be __ro_after_init?

> +integer_param("load-balance-ratelimit", load_balance_ratelimit_us);
> +
>   static struct csched_unit *
>   csched_load_balance(struct csched_private *prv, int cpu,
> -    struct csched_unit *snext, bool *stolen)
> +                    struct csched_unit *snext, bool *stolen)
>   {
>       const struct cpupool *c = get_sched_res(cpu)->cpupool;
>       struct csched_unit *speer;
> @@ -1963,10 +1979,12 @@ static void cf_check csched_schedule(
>            * urgent work... If not, csched_load_balance() will return snext, but
>            * already removed from the runq.
>            */
> -        if ( snext->pri > CSCHED_PRI_TS_OVER )
> -            __runq_remove(snext);
> -        else
> +        if ( snext->pri <= CSCHED_PRI_TS_OVER
> +             && now - spc->last_load_balance > prv->load_balance_ratelimit) {
> +            spc->last_load_balance = now;
>               snext = csched_load_balance(prv, sched_cpu, snext, &migrated);
> +        } else
> +            __runq_remove(snext);
>   
>       } while ( !unit_runnable_state(snext->unit) );
>   
> @@ -2181,6 +2199,14 @@ csched_global_init(void)
>                  XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US, vcpu_migration_delay_us);
>       }
>   
> +    if ( load_balance_ratelimit_us > XEN_SYSCTL_CSCHED_LB_RATE_MAX_US )
> +    {
> +        load_balance_ratelimit_us = CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;
> +        printk("WARNING: load-balance-ratelimit outside of valid range [0,%d]us.\n"
> +               "Resetting to default: %u\n",
> +               XEN_SYSCTL_CSCHED_LB_RATE_MAX_US, load_balance_ratelimit_us);
> +    }
> +
>       return 0;
>   }
>   
> @@ -2223,6 +2249,8 @@ csched_init(struct scheduler *ops)
>   
>       prv->unit_migr_delay = MICROSECS(vcpu_migration_delay_us);
>   
> +    prv->load_balance_ratelimit = MICROSECS(load_balance_ratelimit_us);
> +
>       return 0;
>   }
>   
> diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
> index 9d06e92d0f..48f7f57037 100644
> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -641,6 +641,12 @@ struct xen_sysctl_credit_schedule {
>       */
>   #define XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US (100 * 1000)
>       uint32_t vcpu_migr_delay_us;
> +    /*
> +     * Minimum delay, in microseconds, between load balance
> +     * operations; max 1 second.
> +     */
> +#define XEN_SYSCTL_CSCHED_LB_RATE_MAX_US (1000000)
> +    uint32_t load_balance_ratelimit_us;

Shouldn't this change bump the sysctl interface version?

>   };
>   
>   struct xen_sysctl_credit2_schedule {
Julien Grall June 30, 2023, 5:49 p.m. UTC | #3
Hi,

The subject stay "1/2" but I don't see any "2/2" nor cover letter. Where 
can I find the rest of the series?

Cheers,
George Dunlap July 3, 2023, 11:11 a.m. UTC | #4
On Fri, Jun 30, 2023 at 6:49 PM Julien Grall <julien@xen.org> wrote:

> Hi,
>
> The subject stay "1/2" but I don't see any "2/2" nor cover letter. Where
> can I find the rest of the series?
>

I didn't make a cover letter, but the mail archive seems to have both
messages:

https://lists.xenproject.org/archives/html/xen-devel/2023-06/msg02298.html

 -George
George Dunlap July 3, 2023, 11:39 a.m. UTC | #5
On Fri, Jun 30, 2023 at 6:48 PM Julien Grall <julien@xen.org> wrote:

> Hi George,
>
> On 30/06/2023 12:37, George Dunlap wrote:
> > The credit scheduler tries as hard as it can to ensure that it always
> > runs scheduling units with positive credit (PRI_TS_UNDER) before
> > running those with negative credit (PRI_TS_OVER).  If the next
> > runnable scheduling unit is of priority OVER, it will always run the
> > load balancer, which will scour the system looking for another
> > scheduling unit of the UNDER priority.
> >
> > Unfortunately, as the number of cores on a system has grown, the cost
> > of the work-stealing algorithm has dramatically increased; a recent
> > trace on a system with 128 cores showed this taking over 50
> > microseconds.
> >
> > Add a parameter, load_balance_ratelimit, to limit the frequency of
> > load balance operations on a given pcpu.  Default this to 1
> > millisecond.
> >
> > Invert the load balancing conditional to make it more clear, and line
> > up more closely with the comment above it.
> >
> > Overall it might be cleaner to have the last_load_balance checking
> > happen inside csched_load_balance(), but that would require either
> > passing both now and spc into the function, or looking them up again;
> > both of which seemed to be worse than simply checking and setting the
> > values before calling it.
> >
> > Without this patch, on a system with a vcpu:pcpu ratio of 2:1, running
> > Windows guests (which will end up calling YIELD during spinlock
> > contention), this patch increased performance significantly.
>
> I don't understand this sentence. Did you intende to write
>
> "Without this patch, ..., the performance are significantly worse"?
>

Hmm, yes this was bad editing.  The first clause was written when I was
expecting to include actual numbers; but when I looked at the internal
numbers I had available, they weren't easy to summarize.  The revised
sentence should have simply been:

"On a system with a vcpu:pcpu ratio of 2:1, running Windows guests (which
will end up calling YIELD during spinlock contention), this patch increased
performance significantly."



> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -1856,6 +1856,12 @@ By default, Xen will use the INVPCID instruction
> for TLB management if
> >   it is available.  This option can be used to cause Xen to fall back to
> >   older mechanisms, which are generally slower.
> >
> > +### load-balance-ratelimit
> > +> `= <integer>`
> > +
> > +The minimum interval between load balancing events on a given pcpu.
> > +At the moment only credit honors this parameter.
>
> I would suggest to mention the default value. So a reader don't have to
> look up in the code to find out.
>
> Also, AFAICT, there is max value. I would mention it here too.
>

Ack


> > +/*
> > + * Minimum delay, in microseconds, between load balance operations.
> > + * This prevents spending too much time doing load balancing,
> particularly
> > + * when the system has a high number of YIELDs due to spinlock priority
> inversion.
> > + */
> > +static unsigned int __read_mostly load_balance_ratelimit_us =
> CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;
>
> AFAICT, load_balance_ratelimit_us is not updated after boot. So
> shouldn't the attribute be __ro_after_init?
>

Ack


> +#define XEN_SYSCTL_CSCHED_LB_RATE_MAX_US (1000000)
> > +    uint32_t load_balance_ratelimit_us;
>
> Shouldn't this change bump the sysctl interface version?
>

Er, yes.

v2 on its way.

 -George
Andrew Cooper July 4, 2023, 5:34 p.m. UTC | #6
On 30/06/2023 12:37 pm, George Dunlap wrote:
> The credit scheduler tries as hard as it can to ensure that it always
> runs scheduling units with positive credit (PRI_TS_UNDER) before
> running those with negative credit (PRI_TS_OVER).  If the next
> runnable scheduling unit is of priority OVER, it will always run the
> load balancer, which will scour the system looking for another
> scheduling unit of the UNDER priority.
>
> Unfortunately, as the number of cores on a system has grown, the cost
> of the work-stealing algorithm has dramatically increased; a recent
> trace on a system with 128 cores showed this taking over 50
> microseconds.
>
> Add a parameter, load_balance_ratelimit, to limit the frequency of
> load balance operations on a given pcpu.  Default this to 1
> millisecond.
>
> Invert the load balancing conditional to make it more clear, and line
> up more closely with the comment above it.
>
> Overall it might be cleaner to have the last_load_balance checking
> happen inside csched_load_balance(), but that would require either
> passing both now and spc into the function, or looking them up again;
> both of which seemed to be worse than simply checking and setting the
> values before calling it.
>
> Without this patch, on a system with a vcpu:pcpu ratio of 2:1, running
> Windows guests (which will end up calling YIELD during spinlock
> contention), this patch increased performance significantly.
>
> Signed-off-by: George Dunlap <george.dunlap@cloud.com>
> ---
> CC: Dario Faggioli <dfaggioli@suse.com>
> CC: Andrew Cooper <andrew.cooper3@citrix.com>
> CC: George Dunlap <george.dunlap@citrix.com>
> CC: Jan Beulich <jbeulich@suse.com>
> CC: Julien Grall <julien@xen.org>
> CC: Stefano Stabellini <sstabellini@kernel.org>
> CC: Wei Liu <wl@xen.org>
> ---
>  docs/misc/xen-command-line.pandoc |  6 +++++
>  xen/common/sched/credit.c         | 40 ++++++++++++++++++++++++++-----
>  xen/include/public/sysctl.h       |  6 +++++

Given this filelist, why the sysctl change?

There's no logic to drive this parameter in the xc/libxl param get/set.

The only two in-tree users I can see are xenpm, along with an
unconditional print to stderr saying it's deprecated and to use xl, and xl.

> diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
> index 4060ebdc5d..369557020f 100644
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -1856,6 +1856,12 @@ By default, Xen will use the INVPCID instruction for TLB management if
>  it is available.  This option can be used to cause Xen to fall back to
>  older mechanisms, which are generally slower.
>  
> +### load-balance-ratelimit
> +> `= <integer>`
> +
> +The minimum interval between load balancing events on a given pcpu.
> +At the moment only credit honors this parameter.

So this is intended to be a global scheduler parameter?

> diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c
> index f2cd3d9da3..b8bdfd5f6a 100644
> --- a/xen/common/sched/credit.c
> +++ b/xen/common/sched/credit.c
> @@ -1267,7 +1272,8 @@ csched_sys_cntl(const struct scheduler *ops,
>                   && (params->ratelimit_us > XEN_SYSCTL_SCHED_RATELIMIT_MAX
>                       || params->ratelimit_us < XEN_SYSCTL_SCHED_RATELIMIT_MIN))
>               || MICROSECS(params->ratelimit_us) > MILLISECS(params->tslice_ms)
> -             || params->vcpu_migr_delay_us > XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US )
> +             || params->vcpu_migr_delay_us > XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US
> +             || params->load_balance_ratelimit_us > XEN_SYSCTL_CSCHED_LB_RATE_MAX_US)

Style (give or take this hunk being with some logic to drive the new
sysctl).

> @@ -1963,10 +1979,12 @@ static void cf_check csched_schedule(
>           * urgent work... If not, csched_load_balance() will return snext, but
>           * already removed from the runq.
>           */
> -        if ( snext->pri > CSCHED_PRI_TS_OVER )
> -            __runq_remove(snext);
> -        else
> +        if ( snext->pri <= CSCHED_PRI_TS_OVER
> +             && now - spc->last_load_balance > prv->load_balance_ratelimit) {
> +            spc->last_load_balance = now;
>              snext = csched_load_balance(prv, sched_cpu, snext, &migrated);
> +        } else
> +            __runq_remove(snext);

Style.

~Andrew
George Dunlap July 5, 2023, 9:25 a.m. UTC | #7
On Tue, Jul 4, 2023 at 6:34 PM Andrew Cooper <andrew.cooper3@citrix.com>
wrote:

> On 30/06/2023 12:37 pm, George Dunlap wrote:
> >
> >  docs/misc/xen-command-line.pandoc |  6 +++++
> >  xen/common/sched/credit.c         | 40 ++++++++++++++++++++++++++-----
> >  xen/include/public/sysctl.h       |  6 +++++
>
> Given this filelist, why the sysctl change?
>
> There's no logic to drive this parameter in the xc/libxl param get/set.
>
> The only two in-tree users I can see are xenpm, along with an
> unconditional print to stderr saying it's deprecated and to use xl, and xl.
>

Yes.  I think unless someone objects then, I'll drop the sysctl interface
from this patch, and add it in a follow-up patch (perhaps at a later date,
depending on how much time I have this week).


> > diff --git a/docs/misc/xen-command-line.pandoc
> b/docs/misc/xen-command-line.pandoc
> > index 4060ebdc5d..369557020f 100644
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -1856,6 +1856,12 @@ By default, Xen will use the INVPCID instruction
> for TLB management if
> >  it is available.  This option can be used to cause Xen to fall back to
> >  older mechanisms, which are generally slower.
> >
> > +### load-balance-ratelimit
> > +> `= <integer>`
> > +
> > +The minimum interval between load balancing events on a given pcpu.
> > +At the moment only credit honors this parameter.
>
> So this is intended to be a global scheduler parameter?
>

Yes; nearly every scheduler does load balancing, and so any scheduler may
need this sort of limitation at some point.  It doesn't make sense to have
separate, nearly identical parameters for each scheduler.  At some point
this should probably be implemented for credit2, although perhaps with a
different default.

Ack on the two style comments.

 -George
Andrew Cooper July 5, 2023, 11:33 a.m. UTC | #8
On 05/07/2023 10:25 am, George Dunlap wrote:
> On Tue, Jul 4, 2023 at 6:34 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> On 30/06/2023 12:37 pm, George Dunlap wrote:
>>>
>>>  docs/misc/xen-command-line.pandoc |  6 +++++
>>>  xen/common/sched/credit.c         | 40 ++++++++++++++++++++++++++-----
>>>  xen/include/public/sysctl.h       |  6 +++++
> 
>> Given this filelist, why the sysctl change?
> 
>> There's no logic to drive this parameter in the xc/libxl param get/set.
>> 
>> The only two in-tree users I can see are xenpm, along with an
>> unconditional print to stderr saying it's deprecated and to use xl,
>> and xl.
> 
> 
> Yes.  I think unless someone objects then, I'll drop the sysctl
> interface from this patch, and add it in a follow-up patch (perhaps at a
> later date, depending on how much time I have this week).

Yeah, that sounds fine.

~Andrew
diff mbox series

Patch

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 4060ebdc5d..369557020f 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1856,6 +1856,12 @@  By default, Xen will use the INVPCID instruction for TLB management if
 it is available.  This option can be used to cause Xen to fall back to
 older mechanisms, which are generally slower.
 
+### load-balance-ratelimit
+> `= <integer>`
+
+The minimum interval between load balancing events on a given pcpu.
+At the moment only credit honors this parameter.
+
 ### noirqbalance (x86)
 > `= <boolean>`
 
diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c
index f2cd3d9da3..b8bdfd5f6a 100644
--- a/xen/common/sched/credit.c
+++ b/xen/common/sched/credit.c
@@ -50,6 +50,8 @@ 
 #define CSCHED_TICKS_PER_TSLICE     3
 /* Default timeslice: 30ms */
 #define CSCHED_DEFAULT_TSLICE_MS    30
+/* Default load balancing ratelimit: 1ms */
+#define CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US 1000
 #define CSCHED_CREDITS_PER_MSEC     10
 /* Never set a timer shorter than this value. */
 #define CSCHED_MIN_TIMER            XEN_SYSCTL_SCHED_RATELIMIT_MIN
@@ -153,6 +155,7 @@  struct csched_pcpu {
 
     unsigned int idle_bias;
     unsigned int nr_runnable;
+    s_time_t last_load_balance;
 
     unsigned int tick;
     struct timer ticker;
@@ -218,7 +221,7 @@  struct csched_private {
 
     /* Period of master and tick in milliseconds */
     unsigned int tick_period_us, ticks_per_tslice;
-    s_time_t ratelimit, tslice, unit_migr_delay;
+    s_time_t ratelimit, tslice, unit_migr_delay, load_balance_ratelimit;
 
     struct list_head active_sdom;
     uint32_t weight;
@@ -612,6 +615,8 @@  init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
     BUG_ON(!is_idle_unit(curr_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
     spc->nr_runnable = 0;
+
+    spc->last_load_balance = NOW();
 }
 
 static void cf_check
@@ -1267,7 +1272,8 @@  csched_sys_cntl(const struct scheduler *ops,
                  && (params->ratelimit_us > XEN_SYSCTL_SCHED_RATELIMIT_MAX
                      || params->ratelimit_us < XEN_SYSCTL_SCHED_RATELIMIT_MIN))
              || MICROSECS(params->ratelimit_us) > MILLISECS(params->tslice_ms)
-             || params->vcpu_migr_delay_us > XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US )
+             || params->vcpu_migr_delay_us > XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US
+             || params->load_balance_ratelimit_us > XEN_SYSCTL_CSCHED_LB_RATE_MAX_US)
                 goto out;
 
         spin_lock_irqsave(&prv->lock, flags);
@@ -1278,6 +1284,7 @@  csched_sys_cntl(const struct scheduler *ops,
             printk(XENLOG_INFO "Disabling context switch rate limiting\n");
         prv->ratelimit = MICROSECS(params->ratelimit_us);
         prv->unit_migr_delay = MICROSECS(params->vcpu_migr_delay_us);
+        prv->load_balance_ratelimit = MICROSECS(params->load_balance_ratelimit_us);
         spin_unlock_irqrestore(&prv->lock, flags);
 
         /* FALLTHRU */
@@ -1285,6 +1292,7 @@  csched_sys_cntl(const struct scheduler *ops,
         params->tslice_ms = prv->tslice / MILLISECS(1);
         params->ratelimit_us = prv->ratelimit / MICROSECS(1);
         params->vcpu_migr_delay_us = prv->unit_migr_delay / MICROSECS(1);
+        params->load_balance_ratelimit_us = prv->load_balance_ratelimit / MICROSECS(1);
         rc = 0;
         break;
     }
@@ -1676,9 +1684,17 @@  csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
     return NULL;
 }
 
+/*
+ * Minimum delay, in microseconds, between load balance operations.
+ * This prevents spending too much time doing load balancing, particularly
+ * when the system has a high number of YIELDs due to spinlock priority inversion.
+ */
+static unsigned int __read_mostly load_balance_ratelimit_us = CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;
+integer_param("load-balance-ratelimit", load_balance_ratelimit_us);
+
 static struct csched_unit *
 csched_load_balance(struct csched_private *prv, int cpu,
-    struct csched_unit *snext, bool *stolen)
+                    struct csched_unit *snext, bool *stolen)
 {
     const struct cpupool *c = get_sched_res(cpu)->cpupool;
     struct csched_unit *speer;
@@ -1963,10 +1979,12 @@  static void cf_check csched_schedule(
          * urgent work... If not, csched_load_balance() will return snext, but
          * already removed from the runq.
          */
-        if ( snext->pri > CSCHED_PRI_TS_OVER )
-            __runq_remove(snext);
-        else
+        if ( snext->pri <= CSCHED_PRI_TS_OVER
+             && now - spc->last_load_balance > prv->load_balance_ratelimit) {
+            spc->last_load_balance = now;
             snext = csched_load_balance(prv, sched_cpu, snext, &migrated);
+        } else
+            __runq_remove(snext);
 
     } while ( !unit_runnable_state(snext->unit) );
 
@@ -2181,6 +2199,14 @@  csched_global_init(void)
                XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US, vcpu_migration_delay_us);
     }
 
+    if ( load_balance_ratelimit_us > XEN_SYSCTL_CSCHED_LB_RATE_MAX_US )
+    {
+        load_balance_ratelimit_us = CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;
+        printk("WARNING: load-balance-ratelimit outside of valid range [0,%d]us.\n"
+               "Resetting to default: %u\n",
+               XEN_SYSCTL_CSCHED_LB_RATE_MAX_US, load_balance_ratelimit_us);
+    }
+
     return 0;
 }
 
@@ -2223,6 +2249,8 @@  csched_init(struct scheduler *ops)
 
     prv->unit_migr_delay = MICROSECS(vcpu_migration_delay_us);
 
+    prv->load_balance_ratelimit = MICROSECS(load_balance_ratelimit_us);
+
     return 0;
 }
 
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 9d06e92d0f..48f7f57037 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -641,6 +641,12 @@  struct xen_sysctl_credit_schedule {
     */
 #define XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US (100 * 1000)
     uint32_t vcpu_migr_delay_us;
+    /*
+     * Minimum delay, in microseconds, between load balance
+     * operations; max 1 second.
+     */
+#define XEN_SYSCTL_CSCHED_LB_RATE_MAX_US (1000000)
+    uint32_t load_balance_ratelimit_us;
 };
 
 struct xen_sysctl_credit2_schedule {