Message ID | 20160408012420.10762.61178.stgit@Solace.fritz.box (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 08/04/16 03:24, Dario Faggioli wrote: > In fact, credit2 uses CPU topology to decide how to arrange > its internal runqueues. Before this change, only 'one runqueue > per socket' was allowed. However, experiments have shown that, > for instance, having one runqueue per physical core improves > performance, especially in case hyperthreading is available. > > In general, it makes sense to allow users to pick one runqueue > arrangement at boot time, so that: > - more experiments can be easily performed to even better > assess and improve performance; > - one can select the best configuration for his specific > use case and/or hardware. > > This patch enables the above. > > Note that, for correctly arranging runqueues to be per-core, > just checking cpu_to_core() on the host CPUs is not enough. > In fact, cores (and hyperthreads) on different sockets, can > have the same core (and thread) IDs! We, therefore, need to > check whether the full topology of two CPUs matches, for > them to be put in the same runqueue. > > Note also that the default (although not functional) for > credit2, since now, has been per-socket runqueue. This patch > leaves things that way, to avoid mixing policy and technical > changes. > > Finally, it would be a nice feature to be able to select > a particular runqueue arrangement, even when creating a > Credit2 cpupool. This is left as future work. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Signed-off-by: Uma Sharma <uma.sharma523@gmail.com> Some nits below. > --- > Cc: George Dunlap <george.dunlap@eu.citrix.com> > Cc: Uma Sharma <uma.sharma523@gmail.com> > Cc: Juergen Gross <jgross@suse.com> > --- > Changes from v2: > * valid strings are now in an array, that we scan during > parameter parsing, as suggested during review. > > Cahnges from v1: > * fix bug in parameter parsing, and start using strcmp() > for that, as requested during review. > --- > docs/misc/xen-command-line.markdown | 19 ++++++++ > xen/common/sched_credit2.c | 83 +++++++++++++++++++++++++++++++++-- > 2 files changed, 97 insertions(+), 5 deletions(-) > > diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown > index ca77e3b..0047f94 100644 > --- a/docs/misc/xen-command-line.markdown > +++ b/docs/misc/xen-command-line.markdown > @@ -469,6 +469,25 @@ combination with the `low_crashinfo` command line option. > ### credit2\_load\_window\_shift > > `= <integer>` > > +### credit2\_runqueue > +> `= core | socket | node | all` > + > +> Default: `socket` > + > +Specify how host CPUs are arranged in runqueues. Runqueues are kept > +balanced with respect to the load generated by the vCPUs running on > +them. Smaller runqueues (as in with `core`) means more accurate load > +balancing (for instance, it will deal better with hyperthreading), > +but also more overhead. > + > +Available alternatives, with their meaning, are: > +* `core`: one runqueue per each physical core of the host; > +* `socket`: one runqueue per each physical socket (which often, > + but not always, matches a NUMA node) of the host; > +* `node`: one runqueue per each NUMA node of the host; > +* `all`: just one runqueue shared by all the logical pCPUs of > + the host > + > ### dbgp > > `= ehci[ <integer> | @pci<bus>:<slot>.<func> ]` > > diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c > index a61a45a..eeb3f54 100644 > --- a/xen/common/sched_credit2.c > +++ b/xen/common/sched_credit2.c > @@ -81,10 +81,6 @@ > * Credits are "reset" when the next vcpu in the runqueue is less than > * or equal to zero. At that point, everyone's credits are "clipped" > * to a small value, and a fixed credit is added to everyone. > - * > - * The plan is for all cores that share an L2 will share the same > - * runqueue. At the moment, there is one global runqueue for all > - * cores. > */ > > /* > @@ -193,6 +189,63 @@ static int __read_mostly opt_overload_balance_tolerance = -3; > integer_param("credit2_balance_over", opt_overload_balance_tolerance); > > /* > + * Runqueue organization. > + * > + * The various cpus are to be assigned each one to a runqueue, and we > + * want that to happen basing on topology. At the moment, it is possible > + * to choose to arrange runqueues to be: > + * > + * - per-core: meaning that there will be one runqueue per each physical > + * core of the host. This will happen if the opt_runqueue > + * parameter is set to 'core'; > + * > + * - per-node: meaning that there will be one runqueue per each physical > + * NUMA node of the host. This will happen if the opt_runqueue > + * parameter is set to 'node'; > + * > + * - per-socket: meaning that there will be one runqueue per each physical > + * socket (AKA package, which often, but not always, also > + * matches a NUMA node) of the host; This will happen if > + * the opt_runqueue parameter is set to 'socket'; > + * > + * - global: meaning that there will be only one runqueue to which all the > + * (logical) processors of the host belongs. This will happen if s/belongs/belong/ > + * the opt_runqueue parameter is set to 'all'. > + * > + * Depending on the value of opt_runqueue, therefore, cpus that are part of > + * either the same physical core, or of the same physical socket, will be > + * put together to form runqueues. numa? all? > + */ > +#define OPT_RUNQUEUE_CORE 0 > +#define OPT_RUNQUEUE_SOCKET 1 > +#define OPT_RUNQUEUE_NODE 2 > +#define OPT_RUNQUEUE_ALL 3 > +static const char *const opt_runqueue_str[] = { > + [OPT_RUNQUEUE_CORE] = "core", > + [OPT_RUNQUEUE_SOCKET] = "socket", > + [OPT_RUNQUEUE_NODE] = "node", > + [OPT_RUNQUEUE_ALL] = "all" > +}; > +static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET; > + > +static void parse_credit2_runqueue(const char *s) > +{ > + unsigned int i; > + > + for ( i = 0; i <= OPT_RUNQUEUE_ALL; i++ ) I'd prefer: for ( i = 0; i < ARRAY_SIZE(opt_runqueue_str); i++ ) Juergen > + { > + if ( !strcmp(s, opt_runqueue_str[i]) ) > + { > + opt_runqueue = i; > + return; > + } > + } > + > + printk("WARNING, unrecognized value of credit2_runqueue option!\n"); > +} > +custom_param("credit2_runqueue", parse_credit2_runqueue); > + > +/* > * Per-runqueue data > */ > struct csched2_runqueue_data { > @@ -1974,6 +2027,22 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi) > cpumask_clear_cpu(rqi, &prv->active_queues); > } > > +static inline bool_t same_node(unsigned int cpua, unsigned int cpub) > +{ > + return cpu_to_node(cpua) == cpu_to_node(cpub); > +} > + > +static inline bool_t same_socket(unsigned int cpua, unsigned int cpub) > +{ > + return cpu_to_socket(cpua) == cpu_to_socket(cpub); > +} > + > +static inline bool_t same_core(unsigned int cpua, unsigned int cpub) > +{ > + return same_socket(cpua, cpub) && > + cpu_to_core(cpua) == cpu_to_core(cpub); > +} > + > static unsigned int > cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu) > { > @@ -2006,7 +2075,10 @@ cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu) > BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID || > cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID); > > - if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) ) > + if ( opt_runqueue == OPT_RUNQUEUE_ALL || > + (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) || > + (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) || > + (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) ) > break; > } > > @@ -2170,6 +2242,7 @@ csched2_init(struct scheduler *ops) > printk(" load_window_shift: %d\n", opt_load_window_shift); > printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance); > printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance); > + printk(" runqueues arrangement: %s\n", opt_runqueue_str[opt_runqueue]); > > if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN ) > { > >
diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown index ca77e3b..0047f94 100644 --- a/docs/misc/xen-command-line.markdown +++ b/docs/misc/xen-command-line.markdown @@ -469,6 +469,25 @@ combination with the `low_crashinfo` command line option. ### credit2\_load\_window\_shift > `= <integer>` +### credit2\_runqueue +> `= core | socket | node | all` + +> Default: `socket` + +Specify how host CPUs are arranged in runqueues. Runqueues are kept +balanced with respect to the load generated by the vCPUs running on +them. Smaller runqueues (as in with `core`) means more accurate load +balancing (for instance, it will deal better with hyperthreading), +but also more overhead. + +Available alternatives, with their meaning, are: +* `core`: one runqueue per each physical core of the host; +* `socket`: one runqueue per each physical socket (which often, + but not always, matches a NUMA node) of the host; +* `node`: one runqueue per each NUMA node of the host; +* `all`: just one runqueue shared by all the logical pCPUs of + the host + ### dbgp > `= ehci[ <integer> | @pci<bus>:<slot>.<func> ]` diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index a61a45a..eeb3f54 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -81,10 +81,6 @@ * Credits are "reset" when the next vcpu in the runqueue is less than * or equal to zero. At that point, everyone's credits are "clipped" * to a small value, and a fixed credit is added to everyone. - * - * The plan is for all cores that share an L2 will share the same - * runqueue. At the moment, there is one global runqueue for all - * cores. */ /* @@ -193,6 +189,63 @@ static int __read_mostly opt_overload_balance_tolerance = -3; integer_param("credit2_balance_over", opt_overload_balance_tolerance); /* + * Runqueue organization. + * + * The various cpus are to be assigned each one to a runqueue, and we + * want that to happen basing on topology. At the moment, it is possible + * to choose to arrange runqueues to be: + * + * - per-core: meaning that there will be one runqueue per each physical + * core of the host. This will happen if the opt_runqueue + * parameter is set to 'core'; + * + * - per-node: meaning that there will be one runqueue per each physical + * NUMA node of the host. This will happen if the opt_runqueue + * parameter is set to 'node'; + * + * - per-socket: meaning that there will be one runqueue per each physical + * socket (AKA package, which often, but not always, also + * matches a NUMA node) of the host; This will happen if + * the opt_runqueue parameter is set to 'socket'; + * + * - global: meaning that there will be only one runqueue to which all the + * (logical) processors of the host belongs. This will happen if + * the opt_runqueue parameter is set to 'all'. + * + * Depending on the value of opt_runqueue, therefore, cpus that are part of + * either the same physical core, or of the same physical socket, will be + * put together to form runqueues. + */ +#define OPT_RUNQUEUE_CORE 0 +#define OPT_RUNQUEUE_SOCKET 1 +#define OPT_RUNQUEUE_NODE 2 +#define OPT_RUNQUEUE_ALL 3 +static const char *const opt_runqueue_str[] = { + [OPT_RUNQUEUE_CORE] = "core", + [OPT_RUNQUEUE_SOCKET] = "socket", + [OPT_RUNQUEUE_NODE] = "node", + [OPT_RUNQUEUE_ALL] = "all" +}; +static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET; + +static void parse_credit2_runqueue(const char *s) +{ + unsigned int i; + + for ( i = 0; i <= OPT_RUNQUEUE_ALL; i++ ) + { + if ( !strcmp(s, opt_runqueue_str[i]) ) + { + opt_runqueue = i; + return; + } + } + + printk("WARNING, unrecognized value of credit2_runqueue option!\n"); +} +custom_param("credit2_runqueue", parse_credit2_runqueue); + +/* * Per-runqueue data */ struct csched2_runqueue_data { @@ -1974,6 +2027,22 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi) cpumask_clear_cpu(rqi, &prv->active_queues); } +static inline bool_t same_node(unsigned int cpua, unsigned int cpub) +{ + return cpu_to_node(cpua) == cpu_to_node(cpub); +} + +static inline bool_t same_socket(unsigned int cpua, unsigned int cpub) +{ + return cpu_to_socket(cpua) == cpu_to_socket(cpub); +} + +static inline bool_t same_core(unsigned int cpua, unsigned int cpub) +{ + return same_socket(cpua, cpub) && + cpu_to_core(cpua) == cpu_to_core(cpub); +} + static unsigned int cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu) { @@ -2006,7 +2075,10 @@ cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu) BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID || cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID); - if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) ) + if ( opt_runqueue == OPT_RUNQUEUE_ALL || + (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) || + (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) || + (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) ) break; } @@ -2170,6 +2242,7 @@ csched2_init(struct scheduler *ops) printk(" load_window_shift: %d\n", opt_load_window_shift); printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance); printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance); + printk(" runqueues arrangement: %s\n", opt_runqueue_str[opt_runqueue]); if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN ) {