diff mbox series

[v3,40/47] xen/sched: prepare per-cpupool scheduling granularity

Message ID 20190914085251.18816-41-jgross@suse.com (mailing list archive)
State Superseded
Headers show
Series xen: add core scheduling support | expand

Commit Message

Jürgen Groß Sept. 14, 2019, 8:52 a.m. UTC
On- and offlining cpus with core scheduling is rather complicated as
the cpus are taken on- or offline one by one, but scheduling wants them
rather to be handled per core.

As the future plan is to be able to select scheduling granularity per
cpupool prepare that by storing the granularity in struct cpupool and
struct sched_resource (we need it there for free cpus which are not
associated to any cpupool). Free cpus will always use granularity 1.

Store the selected granularity option (cpu, core or socket) in the
cpupool as well, as we will need it to select the appropriate cpu mask
when populating the cpupool with cpus.

This will make on- and offlining of cpus much easier and avoids
writing code which would needed to be thrown away later.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V1: new patch
---
 xen/common/cpupool.c       |  2 ++
 xen/common/schedule.c      | 27 +++++++++++++++++----------
 xen/include/xen/sched-if.h | 12 ++++++++++++
 3 files changed, 31 insertions(+), 10 deletions(-)

Comments

Jan Beulich Sept. 24, 2019, 1:34 p.m. UTC | #1
On 14.09.2019 10:52, Juergen Gross wrote:
> --- a/xen/common/cpupool.c
> +++ b/xen/common/cpupool.c
> @@ -175,6 +175,8 @@ static struct cpupool *cpupool_create(
>              return NULL;
>          }
>      }
> +    c->granularity = sched_granularity;
> +    c->opt_granularity = opt_sched_granularity;
>  
>      *q = c;
>  
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index e5b7678dc0..b3c1aa0821 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -56,7 +56,8 @@ int sched_ratelimit_us = SCHED_DEFAULT_RATELIMIT_US;
>  integer_param("sched_ratelimit_us", sched_ratelimit_us);
>  
>  /* Number of vcpus per struct sched_unit. */
> -static unsigned int __read_mostly sched_granularity = 1;
> +enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu;
> +unsigned int __read_mostly sched_granularity = 1;

Seeing the replacements you do further down, are these variables
needed at all anymore outside of cpupool.c? If not, they should
probably move there, and remain / become static?

> --- a/xen/include/xen/sched-if.h
> +++ b/xen/include/xen/sched-if.h
> @@ -25,6 +25,15 @@ extern int sched_ratelimit_us;
>  /* Scheduling resource mask. */
>  extern const cpumask_t *sched_res_mask;
>  
> +/* Number of vcpus per struct sched_unit. */
> +enum sched_gran {
> +    SCHED_GRAN_cpu,
> +    SCHED_GRAN_core,
> +    SCHED_GRAN_socket
> +};

Seeing the almost absurd arrangement on my AMD Fam17 system (128 CPUs
per Credit2 runqueue, for a total of two runqueues) I really wonder
whether there shouldn't be a plan for a further intermediate
granularity between "core" and "socket". The other day I did notice
Linux has gained the concept of "die", which would bring the
arrangement to a more reasonable 8 runqueues of 32 CPUs each on this
system. (I'm taking Credit2 as a reference here only.)

> @@ -532,6 +542,8 @@ struct cpupool
>      struct cpupool   *next;
>      struct scheduler *sched;
>      atomic_t         refcnt;
> +    unsigned int     granularity;
> +    enum sched_gran  opt_granularity;

I'd like to suggest to avoid introducing opt_* identifiers not
directly driven by command line options. That'll just end up
confusing people. I have to admit I'm having trouble coming up with
good names for both fields, so I'd like to ask whether you really
need both - isn't what's called "granularity" above a function of
"opt_granularity"? Or alternatively couldn't what's named
"granularity" now be always taken from struct sched_resource? I
take it that within a pool all struct sched_resource instances
would have the same numeric granularity value. And I further take
it that struct sched_resource instances won't freely move between
cpupools (and hence could e.g. be associated with their pool on
linked list, the head of which lives in struct cpupool).

Jan
Jürgen Groß Sept. 25, 2019, 1:31 p.m. UTC | #2
On 24.09.19 15:34, Jan Beulich wrote:
> On 14.09.2019 10:52, Juergen Gross wrote:
>> --- a/xen/common/cpupool.c
>> +++ b/xen/common/cpupool.c
>> @@ -175,6 +175,8 @@ static struct cpupool *cpupool_create(
>>               return NULL;
>>           }
>>       }
>> +    c->granularity = sched_granularity;
>> +    c->opt_granularity = opt_sched_granularity;
>>   
>>       *q = c;
>>   
>> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
>> index e5b7678dc0..b3c1aa0821 100644
>> --- a/xen/common/schedule.c
>> +++ b/xen/common/schedule.c
>> @@ -56,7 +56,8 @@ int sched_ratelimit_us = SCHED_DEFAULT_RATELIMIT_US;
>>   integer_param("sched_ratelimit_us", sched_ratelimit_us);
>>   
>>   /* Number of vcpus per struct sched_unit. */
>> -static unsigned int __read_mostly sched_granularity = 1;
>> +enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu;
>> +unsigned int __read_mostly sched_granularity = 1;
> 
> Seeing the replacements you do further down, are these variables
> needed at all anymore outside of cpupool.c? If not, they should
> probably move there, and remain / become static?

Hmm, good idea.

> 
>> --- a/xen/include/xen/sched-if.h
>> +++ b/xen/include/xen/sched-if.h
>> @@ -25,6 +25,15 @@ extern int sched_ratelimit_us;
>>   /* Scheduling resource mask. */
>>   extern const cpumask_t *sched_res_mask;
>>   
>> +/* Number of vcpus per struct sched_unit. */
>> +enum sched_gran {
>> +    SCHED_GRAN_cpu,
>> +    SCHED_GRAN_core,
>> +    SCHED_GRAN_socket
>> +};
> 
> Seeing the almost absurd arrangement on my AMD Fam17 system (128 CPUs
> per Credit2 runqueue, for a total of two runqueues) I really wonder
> whether there shouldn't be a plan for a further intermediate
> granularity between "core" and "socket". The other day I did notice
> Linux has gained the concept of "die", which would bring the
> arrangement to a more reasonable 8 runqueues of 32 CPUs each on this
> system. (I'm taking Credit2 as a reference here only.)

Okay, another item for "scheduler cleanup" I guess.

> 
>> @@ -532,6 +542,8 @@ struct cpupool
>>       struct cpupool   *next;
>>       struct scheduler *sched;
>>       atomic_t         refcnt;
>> +    unsigned int     granularity;
>> +    enum sched_gran  opt_granularity;
> 
> I'd like to suggest to avoid introducing opt_* identifiers not
> directly driven by command line options. That'll just end up
> confusing people. I have to admit I'm having trouble coming up with
> good names for both fields, so I'd like to ask whether you really
> need both - isn't what's called "granularity" above a function of
> "opt_granularity"?

Only indirectly. I need opt_granularity for selecting the correct
cpumask (cpumask_of(), cpu_sibling_mask(), ...). granularity is the
numerical value which I don't want to calculate each time I need it.

> Or alternatively couldn't what's named
> "granularity" now be always taken from struct sched_resource? I
> take it that within a pool all struct sched_resource instances
> would have the same numeric granularity value. And I further take
> it that struct sched_resource instances won't freely move between
> cpupools (and hence could e.g. be associated with their pool on
> linked list, the head of which lives in struct cpupool).

I think I wouldn't need such a linked list. All cases where I need
cpupool->granularity are not performance critical, so I could easily
calculate it from cpupool->opt_granularity or by using
cpupool->res_valid for finding a sched_resource of that cpupool.

I'll rename cpupool->opt_granularity to gran and drop
cpupool->granularity.


Juergen
Dario Faggioli Sept. 26, 2019, 9:41 a.m. UTC | #3
On Tue, 2019-09-24 at 15:34 +0200, Jan Beulich wrote:
> On 14.09.2019 10:52, Juergen Gross wrote:
> > --- a/xen/include/xen/sched-if.h
> > +++ b/xen/include/xen/sched-if.h
> > @@ -25,6 +25,15 @@ extern int sched_ratelimit_us;
> >  /* Scheduling resource mask. */
> >  extern const cpumask_t *sched_res_mask;
> >  
> > +/* Number of vcpus per struct sched_unit. */
> > +enum sched_gran {
> > +    SCHED_GRAN_cpu,
> > +    SCHED_GRAN_core,
> > +    SCHED_GRAN_socket
> > +};
> 
> Seeing the almost absurd arrangement on my AMD Fam17 system (128 CPUs
> per Credit2 runqueue, for a total of two runqueues) I really wonder
> whether there shouldn't be a plan for a further intermediate
> granularity between "core" and "socket". The other day I did notice
> Linux has gained the concept of "die", which would bring the
> arrangement to a more reasonable 8 runqueues of 32 CPUs each on this
> system. (I'm taking Credit2 as a reference here only.)
> 
The default Credit2 setup on such system does indeed make no sense.

Introducing DIE (or whatever we want to call it) granularity for
Credit2 runqueues, and, in general, doing something more clever when
deciding how many CPUs should end up in the same runqueue is definitely
something we want.

Actually, there are patches already for specifying, at boot time, a
totally arbitrary runqueue arrangement.... I just need to fish them
from the list, rebase and resubmit. This does not cover the case above,
as we will still need a more sensible default, but it goes in the right
direction, I think.

That's, however, something which although quite important for Credit2,
is less of a concern for core-scheduling. In fact, as said elsewhere
already, I really don't expect anyone to want to use either die-
scheduling, socket-scheduling or anything different than core-
scheduling anytime soon.

I'll look into the Credit2 runqueue issue (for 4.14).

Regards
diff mbox series

Patch

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index e0333a8417..c7d8a748d4 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -175,6 +175,8 @@  static struct cpupool *cpupool_create(
             return NULL;
         }
     }
+    c->granularity = sched_granularity;
+    c->opt_granularity = opt_sched_granularity;
 
     *q = c;
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index e5b7678dc0..b3c1aa0821 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -56,7 +56,8 @@  int sched_ratelimit_us = SCHED_DEFAULT_RATELIMIT_US;
 integer_param("sched_ratelimit_us", sched_ratelimit_us);
 
 /* Number of vcpus per struct sched_unit. */
-static unsigned int __read_mostly sched_granularity = 1;
+enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu;
+unsigned int __read_mostly sched_granularity = 1;
 bool __read_mostly sched_disable_smt_switching;
 const cpumask_t *sched_res_mask = &cpumask_all;
 
@@ -412,10 +413,10 @@  static struct sched_unit *sched_alloc_unit(struct vcpu *v)
 {
     struct sched_unit *unit, **prev_unit;
     struct domain *d = v->domain;
+    unsigned int gran = d->cpupool ? d->cpupool->granularity : 1;
 
     for_each_sched_unit ( d, unit )
-        if ( unit->vcpu_list->vcpu_id / sched_granularity ==
-             v->vcpu_id / sched_granularity )
+        if ( unit->vcpu_list->vcpu_id / gran == v->vcpu_id / gran )
             break;
 
     if ( unit )
@@ -582,7 +583,7 @@  int sched_move_domain(struct domain *d, struct cpupool *c)
         return PTR_ERR(domdata);
 
     unit_priv = xzalloc_array(void *,
-                              DIV_ROUND_UP(d->max_vcpus, sched_granularity));
+                              DIV_ROUND_UP(d->max_vcpus, c->granularity));
     if ( unit_priv == NULL )
     {
         sched_free_domdata(c->sched, domdata);
@@ -1825,11 +1826,11 @@  static void sched_switch_units(struct sched_resource *sd,
         if ( is_idle_unit(prev) )
         {
             prev->runstate_cnt[RUNSTATE_running] = 0;
-            prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity;
+            prev->runstate_cnt[RUNSTATE_runnable] = sd->granularity;
         }
         if ( is_idle_unit(next) )
         {
-            next->runstate_cnt[RUNSTATE_running] = sched_granularity;
+            next->runstate_cnt[RUNSTATE_running] = sd->granularity;
             next->runstate_cnt[RUNSTATE_runnable] = 0;
         }
     }
@@ -1978,7 +1979,7 @@  void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext)
     else
     {
         vcpu_context_saved(vprev, vnext);
-        if ( sched_granularity == 1 )
+        if ( sd->granularity == 1 )
             unit_context_saved(sd);
     }
 
@@ -2089,11 +2090,12 @@  static struct sched_unit *sched_wait_rendezvous_in(struct sched_unit *prev,
 {
     struct sched_unit *next;
     struct vcpu *v;
+    unsigned int gran = get_sched_res(cpu)->granularity;
 
     if ( !--prev->rendezvous_in_cnt )
     {
         next = do_schedule(prev, now, cpu);
-        atomic_set(&next->rendezvous_out_cnt, sched_granularity + 1);
+        atomic_set(&next->rendezvous_out_cnt, gran + 1);
         return next;
     }
 
@@ -2213,6 +2215,7 @@  static void schedule(void)
     struct sched_resource *sd;
     spinlock_t           *lock;
     int cpu = smp_processor_id();
+    unsigned int          gran = get_sched_res(cpu)->granularity;
 
     ASSERT_NOT_IN_ATOMIC();
 
@@ -2238,11 +2241,11 @@  static void schedule(void)
 
     stop_timer(&sd->s_timer);
 
-    if ( sched_granularity > 1 )
+    if ( gran > 1 )
     {
         cpumask_t mask;
 
-        prev->rendezvous_in_cnt = sched_granularity;
+        prev->rendezvous_in_cnt = gran;
         cpumask_andnot(&mask, sd->cpus, cpumask_of(cpu));
         cpumask_raise_softirq(&mask, SCHED_SLAVE_SOFTIRQ);
         next = sched_wait_rendezvous_in(prev, &lock, cpu, now);
@@ -2308,6 +2311,9 @@  static int cpu_schedule_up(unsigned int cpu)
     init_timer(&sd->s_timer, s_timer_fn, NULL, cpu);
     atomic_set(&per_cpu(sched_urgent_count, cpu), 0);
 
+    /* We start with cpu granularity. */
+    sd->granularity = 1;
+
     /* Boot CPU is dealt with later in scheduler_init(). */
     if ( cpu == 0 )
         return 0;
@@ -2598,6 +2604,7 @@  int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
     sched_free_vdata(old_ops, vpriv_old);
     sched_free_pdata(old_ops, ppriv_old, cpu);
 
+    get_sched_res(cpu)->granularity = c ? c->granularity : 1;
     get_sched_res(cpu)->cpupool = c;
     /* When a cpu is added to a pool, trigger it to go pick up some work */
     if ( c != NULL )
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 5625cafb6e..cb58bad0ff 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -25,6 +25,15 @@  extern int sched_ratelimit_us;
 /* Scheduling resource mask. */
 extern const cpumask_t *sched_res_mask;
 
+/* Number of vcpus per struct sched_unit. */
+enum sched_gran {
+    SCHED_GRAN_cpu,
+    SCHED_GRAN_core,
+    SCHED_GRAN_socket
+};
+extern enum sched_gran opt_sched_granularity;
+extern unsigned int sched_granularity;
+
 /*
  * In order to allow a scheduler to remap the lock->cpu mapping,
  * we have a per-cpu pointer, along with a pre-allocated set of
@@ -48,6 +57,7 @@  struct sched_resource {
 
     /* Cpu with lowest id in scheduling resource. */
     unsigned int        master_cpu;
+    unsigned int        granularity;
     const cpumask_t    *cpus;           /* cpus covered by this struct     */
 };
 
@@ -532,6 +542,8 @@  struct cpupool
     struct cpupool   *next;
     struct scheduler *sched;
     atomic_t         refcnt;
+    unsigned int     granularity;
+    enum sched_gran  opt_granularity;
 };
 
 #define cpupool_online_cpumask(_pool) \