[54/60] xen/sched: add minimalistic idle scheduler for free cpus
diff mbox series

Message ID 20190528103313.1343-55-jgross@suse.com
State New, archived
Headers show
Series
  • xen: add core scheduling support
Related show

Commit Message

Jürgen Groß May 28, 2019, 10:33 a.m. UTC
Instead of having a full blown scheduler running for the free cpus
add a very minimalistic scheduler for that purpose only ever scheduling
the related idle vcpu. This has the big advantage of not needing any
per-cpu, per-domain or per-scheduling unit data for free cpus and in
turn simplifying moving cpus to and from cpupools a lot.

As this new scheduler is not user selectable don't register it as an
official scheduler, but just include it in schedule.c.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V1: new patch
---
 xen/arch/arm/smpboot.c  |   2 -
 xen/arch/x86/smpboot.c  |   2 -
 xen/common/schedule.c   | 143 +++++++++++++++++++++++-------------------------
 xen/include/xen/sched.h |   1 -
 4 files changed, 67 insertions(+), 81 deletions(-)

Comments

Jan Beulich May 28, 2019, 11:47 a.m. UTC | #1
>>> On 28.05.19 at 12:33, <jgross@suse.com> wrote:
> Instead of having a full blown scheduler running for the free cpus
> add a very minimalistic scheduler for that purpose only ever scheduling
> the related idle vcpu. This has the big advantage of not needing any
> per-cpu, per-domain or per-scheduling unit data for free cpus and in
> turn simplifying moving cpus to and from cpupools a lot.

And the null scheduler is not minimalistic enough?

> --- a/xen/arch/arm/smpboot.c
> +++ b/xen/arch/arm/smpboot.c
> @@ -350,8 +350,6 @@ void start_secondary(unsigned long boot_phys_offset,
>  
>      setup_cpu_sibling_map(cpuid);
>  
> -    scheduler_percpu_init(cpuid);
> -
>      /* Run local notifiers */
>      notify_cpu_starting(cpuid);
>      /*
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -382,8 +382,6 @@ void start_secondary(void *unused)
>  
>      set_cpu_sibling_map(cpu);
>  
> -    scheduler_percpu_init(cpu);
> -
>      init_percpu_time();
>  
>      setup_secondary_APIC_clock();

Seeing you revert here what an earlier patch in the series has added,
I don't suppose re-ordering could avoid this code churn?

Jan
Jürgen Groß May 28, 2019, 11:58 a.m. UTC | #2
On 28/05/2019 13:47, Jan Beulich wrote:
>>>> On 28.05.19 at 12:33, <jgross@suse.com> wrote:
>> Instead of having a full blown scheduler running for the free cpus
>> add a very minimalistic scheduler for that purpose only ever scheduling
>> the related idle vcpu. This has the big advantage of not needing any
>> per-cpu, per-domain or per-scheduling unit data for free cpus and in
>> turn simplifying moving cpus to and from cpupools a lot.
> 
> And the null scheduler is not minimalistic enough?

The main disadvantage of the null scheduler are the need for private
data (which has to be allocated/freed on scheduler switch), its not
yet perfect stability, and its "complexity" (e.g. 900 lines vs. 50).

> 
>> --- a/xen/arch/arm/smpboot.c
>> +++ b/xen/arch/arm/smpboot.c
>> @@ -350,8 +350,6 @@ void start_secondary(unsigned long boot_phys_offset,
>>  
>>      setup_cpu_sibling_map(cpuid);
>>  
>> -    scheduler_percpu_init(cpuid);
>> -
>>      /* Run local notifiers */
>>      notify_cpu_starting(cpuid);
>>      /*
>> --- a/xen/arch/x86/smpboot.c
>> +++ b/xen/arch/x86/smpboot.c
>> @@ -382,8 +382,6 @@ void start_secondary(void *unused)
>>  
>>      set_cpu_sibling_map(cpu);
>>  
>> -    scheduler_percpu_init(cpu);
>> -
>>      init_percpu_time();
>>  
>>      setup_secondary_APIC_clock();
> 
> Seeing you revert here what an earlier patch in the series has added,
> I don't suppose re-ordering could avoid this code churn?

This would require another major shuffling of patches. I'd like to avoid
that.


Juergen
Dario Faggioli May 31, 2019, 2:15 p.m. UTC | #3
On Tue, 2019-05-28 at 13:58 +0200, Juergen Gross wrote:
> On 28/05/2019 13:47, Jan Beulich wrote:
> > > > > On 28.05.19 at 12:33, <jgross@suse.com> wrote:
> > > Instead of having a full blown scheduler running for the free
> > > cpus
> > > add a very minimalistic scheduler for that purpose only ever
> > > scheduling
> > > the related idle vcpu. This has the big advantage of not needing
> > > any
> > > per-cpu, per-domain or per-scheduling unit data for free cpus and
> > > in
> > > turn simplifying moving cpus to and from cpupools a lot.
> > 
> > And the null scheduler is not minimalistic enough?
> 
> The main disadvantage of the null scheduler are the need for private
> data (which has to be allocated/freed on scheduler switch), its not
> yet perfect stability, and its "complexity" (e.g. 900 lines vs. 50).
> 
Yes, I absolutely agree with this approach of having an even simpler
idle scheduler, for the idle vcpus, and not selectable and usable by
the user for other things.

It would, actually, be rather useful in other ways and places, so I'm
actually rather happy to see it appearing (I started multiple times
doing something like this myself, but never finished :-/)

For instance, the fact that we put cpus that are not in any pool in the
default scheduler, was weird (to say the least) from a conceptual point
of view, forced us to do some extra checks (in the form of, e.g.,
cpumask() operations) to avoid actually scheduling vcpus on them and
was causing accounting issues in Credit1.

Now that the free cpus can go and stay in their own "idle scheduler",
those things can all be solved, in the (IMO) best and most clean way.

And, yes, it has to be as simple as possible, even simpler than null...
And I think we can easily see why that's the case, just by looking at
what code this patch let us remove (e.g., the need for some of the
checking of `system_state` in cpupool or scheduler code, just to
mention one).

Thanks and Regards
Dario Faggioli May 31, 2019, 3:52 p.m. UTC | #4
On Tue, 2019-05-28 at 12:33 +0200, Juergen Gross wrote:
> Instead of having a full blown scheduler running for the free cpus
> add a very minimalistic scheduler for that purpose only ever
> scheduling
> the related idle vcpu. This has the big advantage of not needing any
> per-cpu, per-domain or per-scheduling unit data for free cpus and in
> turn simplifying moving cpus to and from cpupools a lot.
> 
> As this new scheduler is not user selectable don't register it as an
> official scheduler, but just include it in schedule.c.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
> V1: new patch
> ---

> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -83,6 +83,57 @@ extern const struct scheduler
> *__start_schedulers_array[], *__end_schedulers_arr
>  
>  static struct scheduler __read_mostly ops;
>  
> [...]
>
> +static void *
> +sched_idle_alloc_vdata(const struct scheduler *ops, struct
> sched_unit *unit,
> +                       void *dd)
> +{
> +    /* Any non-NULL pointer is fine here. */
> +    return (void *)1UL;
> +}
> +
I think the proper thing to do, here, would be to convert alloc_vdata()
to PTR_ERR() & IS_ERR().

That's another patch, of course, and given that this series is rather
big already, I'm not sure about asking for it to be done in the context
of this work.

I can do it myself, either right now or after this series is merged
(for the sake of not making rebasing 60 patches more complicated than
it must be already, for you :-D).

Let me know what you think.

Regards
Jürgen Groß May 31, 2019, 4:44 p.m. UTC | #5
On 31/05/2019 17:52, Dario Faggioli wrote:
> On Tue, 2019-05-28 at 12:33 +0200, Juergen Gross wrote:
>> Instead of having a full blown scheduler running for the free cpus
>> add a very minimalistic scheduler for that purpose only ever
>> scheduling
>> the related idle vcpu. This has the big advantage of not needing any
>> per-cpu, per-domain or per-scheduling unit data for free cpus and in
>> turn simplifying moving cpus to and from cpupools a lot.
>>
>> As this new scheduler is not user selectable don't register it as an
>> official scheduler, but just include it in schedule.c.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> ---
>> V1: new patch
>> ---
> 
>> --- a/xen/common/schedule.c
>> +++ b/xen/common/schedule.c
>> @@ -83,6 +83,57 @@ extern const struct scheduler
>> *__start_schedulers_array[], *__end_schedulers_arr
>>  
>>  static struct scheduler __read_mostly ops;
>>  
>> [...]
>>
>> +static void *
>> +sched_idle_alloc_vdata(const struct scheduler *ops, struct
>> sched_unit *unit,
>> +                       void *dd)
>> +{
>> +    /* Any non-NULL pointer is fine here. */
>> +    return (void *)1UL;
>> +}
>> +
> I think the proper thing to do, here, would be to convert alloc_vdata()
> to PTR_ERR() & IS_ERR().
> 
> That's another patch, of course, and given that this series is rather
> big already, I'm not sure about asking for it to be done in the context
> of this work.
> 
> I can do it myself, either right now or after this series is merged
> (for the sake of not making rebasing 60 patches more complicated than
> it must be already, for you :-D).
> 
> Let me know what you think.

There is only one place left where alloc_vdata is being called for the
idle scheduler, and that is sched_init_vcpu() of an idle vcpu.

I have planned to do a followup series cleaning up the scheduling stuff
where I can add a patch special casing that and removing
sched_idle_alloc_vdata().


Juergen

Patch
diff mbox series

diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
index 9a6582f2a6..f756444362 100644
--- a/xen/arch/arm/smpboot.c
+++ b/xen/arch/arm/smpboot.c
@@ -350,8 +350,6 @@  void start_secondary(unsigned long boot_phys_offset,
 
     setup_cpu_sibling_map(cpuid);
 
-    scheduler_percpu_init(cpuid);
-
     /* Run local notifiers */
     notify_cpu_starting(cpuid);
     /*
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 7e95b2cdac..153bfbb4b7 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -382,8 +382,6 @@  void start_secondary(void *unused)
 
     set_cpu_sibling_map(cpu);
 
-    scheduler_percpu_init(cpu);
-
     init_percpu_time();
 
     setup_secondary_APIC_clock();
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 7a5ab4b1b6..d3e4ae226c 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -83,6 +83,57 @@  extern const struct scheduler *__start_schedulers_array[], *__end_schedulers_arr
 
 static struct scheduler __read_mostly ops;
 
+static spinlock_t *
+sched_idle_switch_sched(struct scheduler *new_ops, unsigned int cpu,
+                        void *pdata, void *vdata)
+{
+    sched_idle_unit(cpu)->priv = NULL;
+
+    return &sched_free_cpu_lock;
+}
+
+static struct sched_resource *
+sched_idle_res_pick(const struct scheduler *ops, struct sched_unit *unit)
+{
+    return unit->res;
+}
+
+static void *
+sched_idle_alloc_vdata(const struct scheduler *ops, struct sched_unit *unit,
+                       void *dd)
+{
+    /* Any non-NULL pointer is fine here. */
+    return (void *)1UL;
+}
+
+static void
+sched_idle_free_vdata(const struct scheduler *ops, void *priv)
+{
+}
+
+static void sched_idle_schedule(
+    const struct scheduler *ops, struct sched_unit *unit, s_time_t now,
+    bool tasklet_work_scheduled)
+{
+    const unsigned int cpu = smp_processor_id();
+
+    unit->next_time = -1;
+    unit->next_task = sched_idle_unit(sched_get_resource_cpu(cpu));
+}
+
+static struct scheduler sched_idle_ops = {
+    .name           = "Idle Scheduler",
+    .opt_name       = "idle",
+    .sched_data     = NULL,
+
+    .pick_resource  = sched_idle_res_pick,
+    .do_schedule    = sched_idle_schedule,
+
+    .alloc_vdata    = sched_idle_alloc_vdata,
+    .free_vdata     = sched_idle_free_vdata,
+    .switch_sched   = sched_idle_switch_sched,
+};
+
 static inline struct vcpu *unit2vcpu_cpu(struct sched_unit *unit,
                                          unsigned int cpu)
 {
@@ -2141,7 +2192,6 @@  static void poll_timer_fn(void *data)
 static int cpu_schedule_up(unsigned int cpu)
 {
     struct sched_resource *sd;
-    void *sched_priv;
 
     sd = xzalloc(struct sched_resource);
     if ( sd == NULL )
@@ -2150,7 +2200,7 @@  static int cpu_schedule_up(unsigned int cpu)
     sd->cpus = cpumask_of(cpu);
     set_sched_res(cpu, sd);
 
-    sd->scheduler = &ops;
+    sd->scheduler = &sched_idle_ops;
     spin_lock_init(&sd->_lock);
     sd->schedule_lock = &sched_free_cpu_lock;
     init_timer(&sd->s_timer, s_timer_fn, NULL, cpu);
@@ -2171,20 +2221,10 @@  static int cpu_schedule_up(unsigned int cpu)
         struct sched_unit *unit = idle->sched_unit;
 
         /*
-         * During (ACPI?) suspend the idle vCPU for this pCPU is not freed,
-         * while its scheduler specific data (what is pointed by sched_priv)
-         * is. Also, at this stage of the resume path, we attach the pCPU
-         * to the default scheduler, no matter in what cpupool it was before
-         * suspend. To avoid inconsistency, let's allocate default scheduler
-         * data for the idle vCPU here. If the pCPU was in a different pool
-         * with a different scheduler, it is schedule_cpu_switch(), invoked
-         * later, that will set things up as appropriate.
+         * No need to allocate any scheduler data, as cpus coming online are
+         * free initially and the idle scheduler doesn't need any data areas
+         * allocated.
          */
-        ASSERT(unit->priv == NULL);
-
-        unit->priv = sched_alloc_vdata(&ops, unit, idle->domain->sched_priv);
-        if ( unit->priv == NULL )
-            return -ENOMEM;
 
         /* Update the resource pointer in the idle unit. */
         unit->res = sd;
@@ -2195,16 +2235,7 @@  static int cpu_schedule_up(unsigned int cpu)
     sd->curr = idle_vcpu[cpu]->sched_unit;
     sd->sched_unit_idle = idle_vcpu[cpu]->sched_unit;
 
-    /*
-     * We don't want to risk calling xfree() on an sd->sched_priv
-     * (e.g., inside free_pdata, from cpu_schedule_down() called
-     * during CPU_UP_CANCELLED) that contains an IS_ERR value.
-     */
-    sched_priv = sched_alloc_pdata(&ops, cpu);
-    if ( IS_ERR(sched_priv) )
-        return PTR_ERR(sched_priv);
-
-    sd->sched_priv = sched_priv;
+    sd->sched_priv = NULL;
 
     return 0;
 }
@@ -2212,13 +2243,6 @@  static int cpu_schedule_up(unsigned int cpu)
 static void cpu_schedule_down(unsigned int cpu)
 {
     struct sched_resource *sd = get_sched_res(cpu);
-    struct scheduler *sched = sd->scheduler;
-
-    sched_free_pdata(sched, sd->sched_priv, cpu);
-    sched_free_vdata(sched, idle_vcpu[cpu]->sched_unit->priv);
-
-    idle_vcpu[cpu]->sched_unit->priv = NULL;
-    sd->sched_priv = NULL;
 
     kill_timer(&sd->s_timer);
 
@@ -2226,26 +2250,14 @@  static void cpu_schedule_down(unsigned int cpu)
     xfree(sd);
 }
 
-void scheduler_percpu_init(unsigned int cpu)
-{
-    struct sched_resource *sd = get_sched_res(cpu);
-    struct scheduler *sched = sd->scheduler;
-
-    if ( system_state != SYS_STATE_resume )
-        sched_init_pdata(sched, sd->sched_priv, cpu);
-}
-
 void sched_rm_cpu(unsigned int cpu)
 {
     int rc;
-    struct sched_resource *sd = get_sched_res(cpu);
-    struct scheduler *sched = sd->scheduler;
 
     rcu_read_lock(&domlist_read_lock);
     rc = cpu_disable_scheduler(cpu);
     BUG_ON(rc);
     rcu_read_unlock(&domlist_read_lock);
-    sched_deinit_pdata(sched, sd->sched_priv, cpu);
     cpu_schedule_down(cpu);
 }
 
@@ -2260,32 +2272,22 @@  static int cpu_schedule_callback(
      * allocating and initializing the per-pCPU scheduler specific data,
      * as well as "registering" this pCPU to the scheduler (which may
      * involve modifying some scheduler wide data structures).
-     * This happens by calling the alloc_pdata and init_pdata hooks, in
-     * this order. A scheduler that does not need to allocate any per-pCPU
-     * data can avoid implementing alloc_pdata. init_pdata may, however, be
-     * necessary/useful in this case too (e.g., it can contain the "register
-     * the pCPU to the scheduler" part). alloc_pdata (if present) is called
-     * during CPU_UP_PREPARE. init_pdata (if present) is called before
-     * CPU_STARTING in scheduler_percpu_init().
+     * As new pCPUs always start as "free" cpus with the minimal idle
+     * scheduler being in charge, we don't need any of that.
      *
      * On the other hand, at teardown, we need to reverse what has been done
-     * during initialization, and then free the per-pCPU specific data. This
-     * happens by calling the deinit_pdata and free_pdata hooks, in this
+     * during initialization, and then free the per-pCPU specific data. A
+     * pCPU brought down is not forced through "free" cpus, so here we need to
+     * use the appropriate hooks.
+     *
+     * This happens by calling the deinit_pdata and free_pdata hooks, in this
      * order. If no per-pCPU memory was allocated, there is no need to
      * provide an implementation of free_pdata. deinit_pdata may, however,
      * be necessary/useful in this case too (e.g., it can undo something done
      * on scheduler wide data structure during init_pdata). Both deinit_pdata
      * and free_pdata are called during CPU_DEAD.
      *
-     * If someting goes wrong during bringup, we go to CPU_UP_CANCELLED
-     * *before* having called init_pdata. In this case, as there is no
-     * initialization needing undoing, only free_pdata should be called.
-     * This means it is possible to call free_pdata just after alloc_pdata,
-     * without a init_pdata/deinit_pdata "cycle" in between the two.
-     *
-     * So, in summary, the usage pattern should look either
-     *  - alloc_pdata-->init_pdata-->deinit_pdata-->free_pdata, or
-     *  - alloc_pdata-->free_pdata.
+     * If someting goes wrong during bringup, we go to CPU_UP_CANCELLED.
      */
     switch ( action )
     {
@@ -2402,9 +2404,6 @@  void __init scheduler_init(void)
         BUG();
     get_sched_res(0)->curr = idle_vcpu[0]->sched_unit;
     get_sched_res(0)->sched_unit_idle = idle_vcpu[0]->sched_unit;
-    get_sched_res(0)->sched_priv = sched_alloc_pdata(&ops, 0);
-    BUG_ON(IS_ERR(get_sched_res(0)->sched_priv));
-    scheduler_percpu_init(0);
 }
 
 /*
@@ -2412,18 +2411,14 @@  void __init scheduler_init(void)
  * cpupool, or subject it to the scheduler of a new cpupool.
  *
  * For the pCPUs that are removed from their cpupool, their scheduler becomes
- * &ops (the default scheduler, selected at boot, which also services the
- * default cpupool). However, as these pCPUs are not really part of any pool,
- * there won't be any scheduling event on them, not even from the default
- * scheduler. Basically, they will just sit idle until they are explicitly
- * added back to a cpupool.
+ * &sched_idle_ops (the idle scheduler).
  */
 int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
 {
     struct vcpu *idle;
     void *ppriv, *ppriv_old, *vpriv, *vpriv_old;
     struct scheduler *old_ops = get_sched_res(cpu)->scheduler;
-    struct scheduler *new_ops = (c == NULL) ? &ops : c->sched;
+    struct scheduler *new_ops = (c == NULL) ? &sched_idle_ops : c->sched;
     struct sched_resource *sd = get_sched_res(cpu);
     struct cpupool *old_pool = sd->cpupool;
     spinlock_t *old_lock, *new_lock;
@@ -2443,9 +2438,6 @@  int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
     ASSERT((c == NULL && !cpumask_test_cpu(cpu, old_pool->cpu_valid)) ||
            (c != NULL && !cpumask_test_cpu(cpu, c->cpu_valid)));
 
-    if ( old_ops == new_ops )
-        goto out;
-
     /*
      * To setup the cpu for the new scheduler we need:
      *  - a valid instance of per-CPU scheduler specific data, as it is
@@ -2498,7 +2490,7 @@  int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
      * taking it, finds all the initializations we've done above in place.
      */
     smp_mb();
-    sd->schedule_lock = c ? new_lock : &sched_free_cpu_lock;
+    sd->schedule_lock = new_lock;
 
     /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
     spin_unlock_irqrestore(old_lock, flags);
@@ -2510,7 +2502,6 @@  int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
     sched_free_vdata(old_ops, vpriv_old);
     sched_free_pdata(old_ops, ppriv_old, cpu);
 
- out:
     get_sched_res(cpu)->granularity = c ? c->granularity : 1;
     get_sched_res(cpu)->cpupool = c;
     /* When a cpu is added to a pool, trigger it to go pick up some work */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 7dc63c449b..e689bba361 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -677,7 +677,6 @@  void __domain_crash(struct domain *d);
 void noreturn asm_domain_crash_synchronous(unsigned long addr);
 
 void scheduler_init(void);
-void scheduler_percpu_init(unsigned int cpu);
 int  sched_init_vcpu(struct vcpu *v);
 void sched_destroy_vcpu(struct vcpu *v);
 int  sched_init_domain(struct domain *d, int poolid);