Message ID | 1460100925.13871.6.camel@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 08/04/16 09:35, Dario Faggioli wrote: > On Fri, 2016-04-08 at 06:18 +0200, Juergen Gross wrote: >> On 08/04/16 03:24, Dario Faggioli wrote: >>> >>> In fact, credit2 uses CPU topology to decide how to arrange >>> its internal runqueues. Before this change, only 'one runqueue >>> per socket' was allowed. However, experiments have shown that, >>> for instance, having one runqueue per physical core improves >>> performance, especially in case hyperthreading is available. >>> >>> In general, it makes sense to allow users to pick one runqueue >>> arrangement at boot time, so that: >>> - more experiments can be easily performed to even better >>> assess and improve performance; >>> - one can select the best configuration for his specific >>> use case and/or hardware. >>> >>> This patch enables the above. >>> >>> Note that, for correctly arranging runqueues to be per-core, >>> just checking cpu_to_core() on the host CPUs is not enough. >>> In fact, cores (and hyperthreads) on different sockets, can >>> have the same core (and thread) IDs! We, therefore, need to >>> check whether the full topology of two CPUs matches, for >>> them to be put in the same runqueue. >>> >>> Note also that the default (although not functional) for >>> credit2, since now, has been per-socket runqueue. This patch >>> leaves things that way, to avoid mixing policy and technical >>> changes. >>> >>> Finally, it would be a nice feature to be able to select >>> a particular runqueue arrangement, even when creating a >>> Credit2 cpupool. This is left as future work. >>> >>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> >>> Signed-off-by: Uma Sharma <uma.sharma523@gmail.com> >> >> Some nits below. >> > Thanks for the quick review! > > A revised version of this patch is provided here (both inlined and > attached), and a branch with the remaining to be committed patches of > this series, and with this patch changed as you suggest, is available > at: > > git://xenbits.xen.org/people/dariof/xen.git rel/sched/credit2/fix-runq-and-haff-v4 > http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/rel/sched/credit2/fix-runq-and-haff-v4 Thanks. Reviewed-by: Juergen Gross <jgross@suse.com> > > Regards, > Dario > --- > commit 7f491488bbff1cc3af021cd29fca7e0fba321e02 > Author: Dario Faggioli <dario.faggioli@citrix.com> > Date: Tue Sep 29 14:05:09 2015 +0200 > > xen: sched: allow for choosing credit2 runqueues configuration at boot > > In fact, credit2 uses CPU topology to decide how to arrange > its internal runqueues. Before this change, only 'one runqueue > per socket' was allowed. However, experiments have shown that, > for instance, having one runqueue per physical core improves > performance, especially in case hyperthreading is available. > > In general, it makes sense to allow users to pick one runqueue > arrangement at boot time, so that: > - more experiments can be easily performed to even better > assess and improve performance; > - one can select the best configuration for his specific > use case and/or hardware. > > This patch enables the above. > > Note that, for correctly arranging runqueues to be per-core, > just checking cpu_to_core() on the host CPUs is not enough. > In fact, cores (and hyperthreads) on different sockets, can > have the same core (and thread) IDs! We, therefore, need to > check whether the full topology of two CPUs matches, for > them to be put in the same runqueue. > > Note also that the default (although not functional) for > credit2, since now, has been per-socket runqueue. This patch > leaves things that way, to avoid mixing policy and technical > changes. > > Finally, it would be a nice feature to be able to select > a particular runqueue arrangement, even when creating a > Credit2 cpupool. This is left as future work. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Signed-off-by: Uma Sharma <uma.sharma523@gmail.com> > --- > Cc: George Dunlap <george.dunlap@eu.citrix.com> > Cc: Uma Sharma <uma.sharma523@gmail.com> > Cc: Juergen Gross <jgross@suse.com> > --- > Changes from v3: > * fix type and other issue in comments; > use ARRAY_SIZE when iterating the parameter string array. > > Changes from v2: > * valid strings are now in an array, that we scan during > parameter parsing, as suggested during review. > > Cahnges from v1: > * fix bug in parameter parsing, and start using strcmp() > for that, as requested during review. > > diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown > index ca77e3b..0047f94 100644 > --- a/docs/misc/xen-command-line.markdown > +++ b/docs/misc/xen-command-line.markdown > @@ -469,6 +469,25 @@ combination with the `low_crashinfo` command line option. > ### credit2\_load\_window\_shift > > `= <integer>` > > +### credit2\_runqueue > +> `= core | socket | node | all` > + > +> Default: `socket` > + > +Specify how host CPUs are arranged in runqueues. Runqueues are kept > +balanced with respect to the load generated by the vCPUs running on > +them. Smaller runqueues (as in with `core`) means more accurate load > +balancing (for instance, it will deal better with hyperthreading), > +but also more overhead. > + > +Available alternatives, with their meaning, are: > +* `core`: one runqueue per each physical core of the host; > +* `socket`: one runqueue per each physical socket (which often, > + but not always, matches a NUMA node) of the host; > +* `node`: one runqueue per each NUMA node of the host; > +* `all`: just one runqueue shared by all the logical pCPUs of > + the host > + > ### dbgp > > `= ehci[ <integer> | @pci<bus>:<slot>.<func> ]` > > diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c > index a61a45a..d43f67a 100644 > --- a/xen/common/sched_credit2.c > +++ b/xen/common/sched_credit2.c > @@ -81,10 +81,6 @@ > * Credits are "reset" when the next vcpu in the runqueue is less than > * or equal to zero. At that point, everyone's credits are "clipped" > * to a small value, and a fixed credit is added to everyone. > - * > - * The plan is for all cores that share an L2 will share the same > - * runqueue. At the moment, there is one global runqueue for all > - * cores. > */ > > /* > @@ -193,6 +189,63 @@ static int __read_mostly opt_overload_balance_tolerance = -3; > integer_param("credit2_balance_over", opt_overload_balance_tolerance); > > /* > + * Runqueue organization. > + * > + * The various cpus are to be assigned each one to a runqueue, and we > + * want that to happen basing on topology. At the moment, it is possible > + * to choose to arrange runqueues to be: > + * > + * - per-core: meaning that there will be one runqueue per each physical > + * core of the host. This will happen if the opt_runqueue > + * parameter is set to 'core'; > + * > + * - per-socket: meaning that there will be one runqueue per each physical > + * socket (AKA package, which often, but not always, also > + * matches a NUMA node) of the host; This will happen if > + * the opt_runqueue parameter is set to 'socket'; > + * > + * - per-node: meaning that there will be one runqueue per each physical > + * NUMA node of the host. This will happen if the opt_runqueue > + * parameter is set to 'node'; > + * > + * - global: meaning that there will be only one runqueue to which all the > + * (logical) processors of the host belong. This will happen if > + * the opt_runqueue parameter is set to 'all'. > + * > + * Depending on the value of opt_runqueue, therefore, cpus that are part of > + * either the same physical core, the same physical socket, the same NUMA > + * node, or just all of them, will be put together to form runqueues. > + */ > +#define OPT_RUNQUEUE_CORE 0 > +#define OPT_RUNQUEUE_SOCKET 1 > +#define OPT_RUNQUEUE_NODE 2 > +#define OPT_RUNQUEUE_ALL 3 > +static const char *const opt_runqueue_str[] = { > + [OPT_RUNQUEUE_CORE] = "core", > + [OPT_RUNQUEUE_SOCKET] = "socket", > + [OPT_RUNQUEUE_NODE] = "node", > + [OPT_RUNQUEUE_ALL] = "all" > +}; > +static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET; > + > +static void parse_credit2_runqueue(const char *s) > +{ > + unsigned int i; > + > + for ( i = 0; i < ARRAY_SIZE(opt_runqueue_str); i++ ) > + { > + if ( !strcmp(s, opt_runqueue_str[i]) ) > + { > + opt_runqueue = i; > + return; > + } > + } > + > + printk("WARNING, unrecognized value of credit2_runqueue option!\n"); > +} > +custom_param("credit2_runqueue", parse_credit2_runqueue); > + > +/* > * Per-runqueue data > */ > struct csched2_runqueue_data { > @@ -1974,6 +2027,22 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi) > cpumask_clear_cpu(rqi, &prv->active_queues); > } > > +static inline bool_t same_node(unsigned int cpua, unsigned int cpub) > +{ > + return cpu_to_node(cpua) == cpu_to_node(cpub); > +} > + > +static inline bool_t same_socket(unsigned int cpua, unsigned int cpub) > +{ > + return cpu_to_socket(cpua) == cpu_to_socket(cpub); > +} > + > +static inline bool_t same_core(unsigned int cpua, unsigned int cpub) > +{ > + return same_socket(cpua, cpub) && > + cpu_to_core(cpua) == cpu_to_core(cpub); > +} > + > static unsigned int > cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu) > { > @@ -2006,7 +2075,10 @@ cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu) > BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID || > cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID); > > - if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) ) > + if ( opt_runqueue == OPT_RUNQUEUE_ALL || > + (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) || > + (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) || > + (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) ) > break; > } > > @@ -2170,6 +2242,7 @@ csched2_init(struct scheduler *ops) > printk(" load_window_shift: %d\n", opt_load_window_shift); > printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance); > printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance); > + printk(" runqueues arrangement: %s\n", opt_runqueue_str[opt_runqueue]); > > if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN ) > { >
On Fri, 2016-04-08 at 09:39 +0200, Juergen Gross wrote: > On 08/04/16 09:35, Dario Faggioli wrote: > > > > A revised version of this patch is provided here (both inlined and > > attached), and a branch with the remaining to be committed patches > > of > > this series, and with this patch changed as you suggest, is > > available > > at: > > > > git://xenbits.xen.org/people/dariof/xen.git rel/sched/credit2/fix- > > runq-and-haff-v4 > > http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog; > > h=refs/heads/rel/sched/credit2/fix-runq-and-haff-v4 > > Thanks. > Well, thanks to you. :-) > Reviewed-by: Juergen Gross <jgross@suse.com> > I've updated the branch at: git://xenbits.xen.org/people/dariof/xen.git rel/sched/credit2/fix-runq-and-haff-v4 http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/rel/sched/credit2/fix-runq-and-haff-v4 so that this patch, in there, now has this tag. I should say that I force-pushed to it, but I don't really expect anyone to be tracking it. :-P Regards, Dario
On 08/04/16 08:35, Dario Faggioli wrote: > On Fri, 2016-04-08 at 06:18 +0200, Juergen Gross wrote: >> On 08/04/16 03:24, Dario Faggioli wrote: >>> >>> In fact, credit2 uses CPU topology to decide how to arrange >>> its internal runqueues. Before this change, only 'one runqueue >>> per socket' was allowed. However, experiments have shown that, >>> for instance, having one runqueue per physical core improves >>> performance, especially in case hyperthreading is available. >>> >>> In general, it makes sense to allow users to pick one runqueue >>> arrangement at boot time, so that: >>> - more experiments can be easily performed to even better >>> assess and improve performance; >>> - one can select the best configuration for his specific >>> use case and/or hardware. >>> >>> This patch enables the above. >>> >>> Note that, for correctly arranging runqueues to be per-core, >>> just checking cpu_to_core() on the host CPUs is not enough. >>> In fact, cores (and hyperthreads) on different sockets, can >>> have the same core (and thread) IDs! We, therefore, need to >>> check whether the full topology of two CPUs matches, for >>> them to be put in the same runqueue. >>> >>> Note also that the default (although not functional) for >>> credit2, since now, has been per-socket runqueue. This patch >>> leaves things that way, to avoid mixing policy and technical >>> changes. >>> >>> Finally, it would be a nice feature to be able to select >>> a particular runqueue arrangement, even when creating a >>> Credit2 cpupool. This is left as future work. >>> >>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> >>> Signed-off-by: Uma Sharma <uma.sharma523@gmail.com> >> >> Some nits below. >> > Thanks for the quick review! > > A revised version of this patch is provided here (both inlined and > attached), and a branch with the remaining to be committed patches of > this series, and with this patch changed as you suggest, is available > at: > > git://xenbits.xen.org/people/dariof/xen.git rel/sched/credit2/fix-runq-and-haff-v4 > http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/rel/sched/credit2/fix-runq-and-haff-v4 > > Regards, > Dario > --- > commit 7f491488bbff1cc3af021cd29fca7e0fba321e02 > Author: Dario Faggioli <dario.faggioli@citrix.com> > Date: Tue Sep 29 14:05:09 2015 +0200 > > xen: sched: allow for choosing credit2 runqueues configuration at boot > > In fact, credit2 uses CPU topology to decide how to arrange > its internal runqueues. Before this change, only 'one runqueue > per socket' was allowed. However, experiments have shown that, > for instance, having one runqueue per physical core improves > performance, especially in case hyperthreading is available. > > In general, it makes sense to allow users to pick one runqueue > arrangement at boot time, so that: > - more experiments can be easily performed to even better > assess and improve performance; > - one can select the best configuration for his specific > use case and/or hardware. > > This patch enables the above. > > Note that, for correctly arranging runqueues to be per-core, > just checking cpu_to_core() on the host CPUs is not enough. > In fact, cores (and hyperthreads) on different sockets, can > have the same core (and thread) IDs! We, therefore, need to > check whether the full topology of two CPUs matches, for > them to be put in the same runqueue. > > Note also that the default (although not functional) for > credit2, since now, has been per-socket runqueue. This patch > leaves things that way, to avoid mixing policy and technical > changes. > > Finally, it would be a nice feature to be able to select > a particular runqueue arrangement, even when creating a > Credit2 cpupool. This is left as future work. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Signed-off-by: Uma Sharma <uma.sharma523@gmail.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> > --- > Cc: George Dunlap <george.dunlap@eu.citrix.com> > Cc: Uma Sharma <uma.sharma523@gmail.com> > Cc: Juergen Gross <jgross@suse.com> > --- > Changes from v3: > * fix type and other issue in comments; > use ARRAY_SIZE when iterating the parameter string array. > > Changes from v2: > * valid strings are now in an array, that we scan during > parameter parsing, as suggested during review. > > Cahnges from v1: > * fix bug in parameter parsing, and start using strcmp() > for that, as requested during review. > > diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown > index ca77e3b..0047f94 100644 > --- a/docs/misc/xen-command-line.markdown > +++ b/docs/misc/xen-command-line.markdown > @@ -469,6 +469,25 @@ combination with the `low_crashinfo` command line option. > ### credit2\_load\_window\_shift > > `= <integer>` > > +### credit2\_runqueue > +> `= core | socket | node | all` > + > +> Default: `socket` > + > +Specify how host CPUs are arranged in runqueues. Runqueues are kept > +balanced with respect to the load generated by the vCPUs running on > +them. Smaller runqueues (as in with `core`) means more accurate load > +balancing (for instance, it will deal better with hyperthreading), > +but also more overhead. > + > +Available alternatives, with their meaning, are: > +* `core`: one runqueue per each physical core of the host; > +* `socket`: one runqueue per each physical socket (which often, > + but not always, matches a NUMA node) of the host; > +* `node`: one runqueue per each NUMA node of the host; > +* `all`: just one runqueue shared by all the logical pCPUs of > + the host > + > ### dbgp > > `= ehci[ <integer> | @pci<bus>:<slot>.<func> ]` > > diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c > index a61a45a..d43f67a 100644 > --- a/xen/common/sched_credit2.c > +++ b/xen/common/sched_credit2.c > @@ -81,10 +81,6 @@ > * Credits are "reset" when the next vcpu in the runqueue is less than > * or equal to zero. At that point, everyone's credits are "clipped" > * to a small value, and a fixed credit is added to everyone. > - * > - * The plan is for all cores that share an L2 will share the same > - * runqueue. At the moment, there is one global runqueue for all > - * cores. > */ > > /* > @@ -193,6 +189,63 @@ static int __read_mostly opt_overload_balance_tolerance = -3; > integer_param("credit2_balance_over", opt_overload_balance_tolerance); > > /* > + * Runqueue organization. > + * > + * The various cpus are to be assigned each one to a runqueue, and we > + * want that to happen basing on topology. At the moment, it is possible > + * to choose to arrange runqueues to be: > + * > + * - per-core: meaning that there will be one runqueue per each physical > + * core of the host. This will happen if the opt_runqueue > + * parameter is set to 'core'; > + * > + * - per-socket: meaning that there will be one runqueue per each physical > + * socket (AKA package, which often, but not always, also > + * matches a NUMA node) of the host; This will happen if > + * the opt_runqueue parameter is set to 'socket'; > + * > + * - per-node: meaning that there will be one runqueue per each physical > + * NUMA node of the host. This will happen if the opt_runqueue > + * parameter is set to 'node'; > + * > + * - global: meaning that there will be only one runqueue to which all the > + * (logical) processors of the host belong. This will happen if > + * the opt_runqueue parameter is set to 'all'. > + * > + * Depending on the value of opt_runqueue, therefore, cpus that are part of > + * either the same physical core, the same physical socket, the same NUMA > + * node, or just all of them, will be put together to form runqueues. > + */ > +#define OPT_RUNQUEUE_CORE 0 > +#define OPT_RUNQUEUE_SOCKET 1 > +#define OPT_RUNQUEUE_NODE 2 > +#define OPT_RUNQUEUE_ALL 3 > +static const char *const opt_runqueue_str[] = { > + [OPT_RUNQUEUE_CORE] = "core", > + [OPT_RUNQUEUE_SOCKET] = "socket", > + [OPT_RUNQUEUE_NODE] = "node", > + [OPT_RUNQUEUE_ALL] = "all" > +}; > +static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET; > + > +static void parse_credit2_runqueue(const char *s) > +{ > + unsigned int i; > + > + for ( i = 0; i < ARRAY_SIZE(opt_runqueue_str); i++ ) > + { > + if ( !strcmp(s, opt_runqueue_str[i]) ) > + { > + opt_runqueue = i; > + return; > + } > + } > + > + printk("WARNING, unrecognized value of credit2_runqueue option!\n"); > +} > +custom_param("credit2_runqueue", parse_credit2_runqueue); > + > +/* > * Per-runqueue data > */ > struct csched2_runqueue_data { > @@ -1974,6 +2027,22 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi) > cpumask_clear_cpu(rqi, &prv->active_queues); > } > > +static inline bool_t same_node(unsigned int cpua, unsigned int cpub) > +{ > + return cpu_to_node(cpua) == cpu_to_node(cpub); > +} > + > +static inline bool_t same_socket(unsigned int cpua, unsigned int cpub) > +{ > + return cpu_to_socket(cpua) == cpu_to_socket(cpub); > +} > + > +static inline bool_t same_core(unsigned int cpua, unsigned int cpub) > +{ > + return same_socket(cpua, cpub) && > + cpu_to_core(cpua) == cpu_to_core(cpub); > +} > + > static unsigned int > cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu) > { > @@ -2006,7 +2075,10 @@ cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu) > BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID || > cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID); > > - if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) ) > + if ( opt_runqueue == OPT_RUNQUEUE_ALL || > + (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) || > + (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) || > + (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) ) > break; > } > > @@ -2170,6 +2242,7 @@ csched2_init(struct scheduler *ops) > printk(" load_window_shift: %d\n", opt_load_window_shift); > printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance); > printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance); > + printk(" runqueues arrangement: %s\n", opt_runqueue_str[opt_runqueue]); > > if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN ) > { >
Dario Faggioli writes ("[Xen-devel] [PATCH v3 00/11] Fixes and improvement (including hard affinity!) for Credit2"): > Now it's only these two patches that need being Acked: > > 04/11 xen: sched: close potential races when switching scheduler to CPUs > 08/11 xen: sched: allow for choosing credit2 runqueues configuration at boo... Dario Faggioli writes ("Re: [Xen-devel] [PATCH v3 08/11] xen: sched: allow for choosing credit2 runqueues configuration at boot"): > A revised version of this patch is provided here (both inlined and > attached), and a branch with the remaining to be committed patches of > this series, and with this patch changed as you suggest, is available > at: > > git://xenbits.xen.org/people/dariof/xen.git rel/sched/credit2/fix-runq-and-haff-v4 > http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/rel/sched/credit2/fix-runq-and-haff-v4 Thanks for the convenient git branch. I double checked the acks (and the threads on-list), tidied up a few typos in commit messages, folded in the acks not yet included, and have pushed it to staging. Ian.
On Fri, Apr 08, 2016 at 04:13:56PM +0100, Ian Jackson wrote: > Dario Faggioli writes ("[Xen-devel] [PATCH v3 00/11] Fixes and improvement (including hard affinity!) for Credit2"): > > Now it's only these two patches that need being Acked: > > > > 04/11 xen: sched: close potential races when switching scheduler to CPUs > > 08/11 xen: sched: allow for choosing credit2 runqueues configuration at boo... > > Dario Faggioli writes ("Re: [Xen-devel] [PATCH v3 08/11] xen: sched: allow for choosing credit2 runqueues configuration at boot"): > > A revised version of this patch is provided here (both inlined and > > attached), and a branch with the remaining to be committed patches of > > this series, and with this patch changed as you suggest, is available > > at: > > > > git://xenbits.xen.org/people/dariof/xen.git rel/sched/credit2/fix-runq-and-haff-v4 > > http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/rel/sched/credit2/fix-runq-and-haff-v4 > > Thanks for the convenient git branch. I double checked the acks (and > the threads on-list), tidied up a few typos in commit messages, folded > in the acks not yet included, and have pushed it to staging. Oh. Helps if you actually read the whole thread :-) Thanks Ian!
commit 7f491488bbff1cc3af021cd29fca7e0fba321e02 Author: Dario Faggioli <dario.faggioli@citrix.com> Date: Tue Sep 29 14:05:09 2015 +0200 xen: sched: allow for choosing credit2 runqueues configuration at boot In fact, credit2 uses CPU topology to decide how to arrange its internal runqueues. Before this change, only 'one runqueue per socket' was allowed. However, experiments have shown that, for instance, having one runqueue per physical core improves performance, especially in case hyperthreading is available. In general, it makes sense to allow users to pick one runqueue arrangement at boot time, so that: - more experiments can be easily performed to even better assess and improve performance; - one can select the best configuration for his specific use case and/or hardware. This patch enables the above. Note that, for correctly arranging runqueues to be per-core, just checking cpu_to_core() on the host CPUs is not enough. In fact, cores (and hyperthreads) on different sockets, can have the same core (and thread) IDs! We, therefore, need to check whether the full topology of two CPUs matches, for them to be put in the same runqueue. Note also that the default (although not functional) for credit2, since now, has been per-socket runqueue. This patch leaves things that way, to avoid mixing policy and technical changes. Finally, it would be a nice feature to be able to select a particular runqueue arrangement, even when creating a Credit2 cpupool. This is left as future work. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Signed-off-by: Uma Sharma <uma.sharma523@gmail.com> --- Cc: George Dunlap <george.dunlap@eu.citrix.com> Cc: Uma Sharma <uma.sharma523@gmail.com> Cc: Juergen Gross <jgross@suse.com> --- Changes from v3: * fix type and other issue in comments; use ARRAY_SIZE when iterating the parameter string array. Changes from v2: * valid strings are now in an array, that we scan during parameter parsing, as suggested during review. Cahnges from v1: * fix bug in parameter parsing, and start using strcmp() for that, as requested during review. diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown index ca77e3b..0047f94 100644 --- a/docs/misc/xen-command-line.markdown +++ b/docs/misc/xen-command-line.markdown @@ -469,6 +469,25 @@ combination with the `low_crashinfo` command line option. ### credit2\_load\_window\_shift > `= <integer>` +### credit2\_runqueue +> `= core | socket | node | all` + +> Default: `socket` + +Specify how host CPUs are arranged in runqueues. Runqueues are kept +balanced with respect to the load generated by the vCPUs running on +them. Smaller runqueues (as in with `core`) means more accurate load +balancing (for instance, it will deal better with hyperthreading), +but also more overhead. + +Available alternatives, with their meaning, are: +* `core`: one runqueue per each physical core of the host; +* `socket`: one runqueue per each physical socket (which often, + but not always, matches a NUMA node) of the host; +* `node`: one runqueue per each NUMA node of the host; +* `all`: just one runqueue shared by all the logical pCPUs of + the host + ### dbgp > `= ehci[ <integer> | @pci<bus>:<slot>.<func> ]` diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index a61a45a..d43f67a 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -81,10 +81,6 @@ * Credits are "reset" when the next vcpu in the runqueue is less than * or equal to zero. At that point, everyone's credits are "clipped" * to a small value, and a fixed credit is added to everyone. - * - * The plan is for all cores that share an L2 will share the same - * runqueue. At the moment, there is one global runqueue for all - * cores. */ /* @@ -193,6 +189,63 @@ static int __read_mostly opt_overload_balance_tolerance = -3; integer_param("credit2_balance_over", opt_overload_balance_tolerance); /* + * Runqueue organization. + * + * The various cpus are to be assigned each one to a runqueue, and we + * want that to happen basing on topology. At the moment, it is possible + * to choose to arrange runqueues to be: + * + * - per-core: meaning that there will be one runqueue per each physical + * core of the host. This will happen if the opt_runqueue + * parameter is set to 'core'; + * + * - per-socket: meaning that there will be one runqueue per each physical + * socket (AKA package, which often, but not always, also + * matches a NUMA node) of the host; This will happen if + * the opt_runqueue parameter is set to 'socket'; + * + * - per-node: meaning that there will be one runqueue per each physical + * NUMA node of the host. This will happen if the opt_runqueue + * parameter is set to 'node'; + * + * - global: meaning that there will be only one runqueue to which all the + * (logical) processors of the host belong. This will happen if + * the opt_runqueue parameter is set to 'all'. + * + * Depending on the value of opt_runqueue, therefore, cpus that are part of + * either the same physical core, the same physical socket, the same NUMA + * node, or just all of them, will be put together to form runqueues. + */ +#define OPT_RUNQUEUE_CORE 0 +#define OPT_RUNQUEUE_SOCKET 1 +#define OPT_RUNQUEUE_NODE 2 +#define OPT_RUNQUEUE_ALL 3 +static const char *const opt_runqueue_str[] = { + [OPT_RUNQUEUE_CORE] = "core", + [OPT_RUNQUEUE_SOCKET] = "socket", + [OPT_RUNQUEUE_NODE] = "node", + [OPT_RUNQUEUE_ALL] = "all" +}; +static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET; + +static void parse_credit2_runqueue(const char *s) +{ + unsigned int i; + + for ( i = 0; i < ARRAY_SIZE(opt_runqueue_str); i++ ) + { + if ( !strcmp(s, opt_runqueue_str[i]) ) + { + opt_runqueue = i; + return; + } + } + + printk("WARNING, unrecognized value of credit2_runqueue option!\n"); +} +custom_param("credit2_runqueue", parse_credit2_runqueue); + +/* * Per-runqueue data */ struct csched2_runqueue_data { @@ -1974,6 +2027,22 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi) cpumask_clear_cpu(rqi, &prv->active_queues); } +static inline bool_t same_node(unsigned int cpua, unsigned int cpub) +{ + return cpu_to_node(cpua) == cpu_to_node(cpub); +} + +static inline bool_t same_socket(unsigned int cpua, unsigned int cpub) +{ + return cpu_to_socket(cpua) == cpu_to_socket(cpub); +} + +static inline bool_t same_core(unsigned int cpua, unsigned int cpub) +{ + return same_socket(cpua, cpub) && + cpu_to_core(cpua) == cpu_to_core(cpub); +} + static unsigned int cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu) { @@ -2006,7 +2075,10 @@ cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu) BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID || cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID); - if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) ) + if ( opt_runqueue == OPT_RUNQUEUE_ALL || + (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) || + (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) || + (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) ) break; } @@ -2170,6 +2242,7 @@ csched2_init(struct scheduler *ops) printk(" load_window_shift: %d\n", opt_load_window_shift); printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance); printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance); + printk(" runqueues arrangement: %s\n", opt_runqueue_str[opt_runqueue]); if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN ) {