diff mbox series

[03/60] xen/sched: let sched_switch_sched() return new lock address

Message ID 20190528103313.1343-4-jgross@suse.com (mailing list archive)
State New, archived
Headers show
Series xen: add core scheduling support | expand

Commit Message

Jürgen Groß May 28, 2019, 10:32 a.m. UTC
Instead of setting the scheduler percpu lock address in each of the
switch_sched instances of the different schedulers do that in
schedule_cpu_switch() which is the single caller of that function.
For that purpose let sched_switch_sched() just return the new lock
address.

This allows to set the new struct scheduler and struct schedule_data
values in the percpu area in schedule_cpu_switch() instead of the
schedulers, too.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/common/sched_arinc653.c | 14 ++------------
 xen/common/sched_credit.c   | 13 ++-----------
 xen/common/sched_credit2.c  | 15 +++------------
 xen/common/sched_null.c     | 16 ++++------------
 xen/common/sched_rt.c       | 12 ++----------
 xen/common/schedule.c       | 18 +++++++++++++++---
 xen/include/xen/sched-if.h  |  9 +++++----
 7 files changed, 33 insertions(+), 64 deletions(-)

Comments

Dario Faggioli June 11, 2019, 4:55 p.m. UTC | #1
On Tue, 2019-05-28 at 12:32 +0200, Juergen Gross wrote:
> Instead of setting the scheduler percpu lock address in each of the
> switch_sched instances of the different schedulers do that in
> schedule_cpu_switch() which is the single caller of that function.
> For that purpose let sched_switch_sched() just return the new lock
> address.
> 
This looks good to me. The only actual functional/behavioral difference
I've been able to spot is the fact that, in Credit2, we currently
switch the lock pointer while still holding the write lock on the
global scheduler. After this change, we don't any longer.

That being said, I've tried to think about how this could be a problem,
but failed at imagining such a scenario, so:

> Signed-off-by: Juergen Gross <jgross@suse.com>
>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

I'm wondering whether it make sense for the above to be quickly
mentioned in the changelog, but am leaning toward "not really
necessary". In particular, I don't think it's worth to respin the patch
just for that... So, either just something that can be added while
committing, or forget it.

Regards
Jan Beulich June 12, 2019, 7:40 a.m. UTC | #2
>>> On 11.06.19 at 18:55, <dfaggioli@suse.com> wrote:
> On Tue, 2019-05-28 at 12:32 +0200, Juergen Gross wrote:
>> Instead of setting the scheduler percpu lock address in each of the
>> switch_sched instances of the different schedulers do that in
>> schedule_cpu_switch() which is the single caller of that function.
>> For that purpose let sched_switch_sched() just return the new lock
>> address.
>> 
> This looks good to me. The only actual functional/behavioral difference
> I've been able to spot is the fact that, in Credit2, we currently
> switch the lock pointer while still holding the write lock on the
> global scheduler. After this change, we don't any longer.
> 
> That being said, I've tried to think about how this could be a problem,
> but failed at imagining such a scenario, so:
> 
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>>
> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
> 
> I'm wondering whether it make sense for the above to be quickly
> mentioned in the changelog, but am leaning toward "not really
> necessary". In particular, I don't think it's worth to respin the patch
> just for that... So, either just something that can be added while
> committing, or forget it.

I'd be happy to add something while committing, but one of you
would need to propose the wording of this "something".

Jan
Andrew Cooper June 12, 2019, 8:05 a.m. UTC | #3
On 28/05/2019 11:32, Juergen Gross wrote:
> @@ -1870,8 +1871,19 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>      old_lock = pcpu_schedule_lock_irq(cpu);
>  
>      vpriv_old = idle->sched_priv;
> -    ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
> -    sched_switch_sched(new_ops, cpu, ppriv, vpriv);
> +    ppriv_old = sd->sched_priv;
> +    new_lock = sched_switch_sched(new_ops, cpu, ppriv, vpriv);
> +
> +    per_cpu(scheduler, cpu) = new_ops;
> +    sd->sched_priv = ppriv;
> +
> +    /*
> +     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
> +     * if it is free (and it can be) we want that anyone that manages
> +     * taking it, finds all the initializations we've done above in place.
> +     */
> +    smp_mb();

I realise you're just moving existing code, but this barrier sticks out
like a sore thumb.

A full memory barrier is a massive overhead for what should be
smp_wmb().  The matching barrier is actually hidden in the implicit
semantics of managing to lock sd->schedule_lock (which is trial an error
anyway), but the only thing that matters here is that all other written
data is in place first.

Beyond that, local causality will cause all reads to be in order (not
that the are important) due to logic dependencies.  Any that miss out on
this are a optimisation-waiting-to-happen as the compiler could elide
them fully.

~Andrew

> +    sd->schedule_lock = new_lock;
>  
>      /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
>      spin_unlock_irq(old_lock);
Jürgen Groß June 12, 2019, 8:06 a.m. UTC | #4
On 12.06.19 09:40, Jan Beulich wrote:
>>>> On 11.06.19 at 18:55, <dfaggioli@suse.com> wrote:
>> On Tue, 2019-05-28 at 12:32 +0200, Juergen Gross wrote:
>>> Instead of setting the scheduler percpu lock address in each of the
>>> switch_sched instances of the different schedulers do that in
>>> schedule_cpu_switch() which is the single caller of that function.
>>> For that purpose let sched_switch_sched() just return the new lock
>>> address.
>>>
>> This looks good to me. The only actual functional/behavioral difference
>> I've been able to spot is the fact that, in Credit2, we currently
>> switch the lock pointer while still holding the write lock on the
>> global scheduler. After this change, we don't any longer.
>>
>> That being said, I've tried to think about how this could be a problem,
>> but failed at imagining such a scenario, so:
>>
>>> Signed-off-by: Juergen Gross <jgross@suse.com>
>>>
>> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
>>
>> I'm wondering whether it make sense for the above to be quickly
>> mentioned in the changelog, but am leaning toward "not really
>> necessary". In particular, I don't think it's worth to respin the patch
>> just for that... So, either just something that can be added while
>> committing, or forget it.
> 
> I'd be happy to add something while committing, but one of you
> would need to propose the wording of this "something".

What about:

It should be noted that in credit2 the lock used to be set while still
holding the global scheduler write lock, which will no longer be true
with the new scheme applied. This is actually no problem as the write
lock is meant to guard the call of init_pdata() which still is true.


Juergen
Jürgen Groß June 12, 2019, 8:19 a.m. UTC | #5
On 12.06.19 10:05, Andrew Cooper wrote:
> On 28/05/2019 11:32, Juergen Gross wrote:
>> @@ -1870,8 +1871,19 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>>       old_lock = pcpu_schedule_lock_irq(cpu);
>>   
>>       vpriv_old = idle->sched_priv;
>> -    ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
>> -    sched_switch_sched(new_ops, cpu, ppriv, vpriv);
>> +    ppriv_old = sd->sched_priv;
>> +    new_lock = sched_switch_sched(new_ops, cpu, ppriv, vpriv);
>> +
>> +    per_cpu(scheduler, cpu) = new_ops;
>> +    sd->sched_priv = ppriv;
>> +
>> +    /*
>> +     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
>> +     * if it is free (and it can be) we want that anyone that manages
>> +     * taking it, finds all the initializations we've done above in place.
>> +     */
>> +    smp_mb();
> 
> I realise you're just moving existing code, but this barrier sticks out
> like a sore thumb.
> 
> A full memory barrier is a massive overhead for what should be
> smp_wmb().  The matching barrier is actually hidden in the implicit
> semantics of managing to lock sd->schedule_lock (which is trial an error
> anyway), but the only thing that matters here is that all other written
> data is in place first.
> 
> Beyond that, local causality will cause all reads to be in order (not
> that the are important) due to logic dependencies.  Any that miss out on
> this are a optimisation-waiting-to-happen as the compiler could elide
> them fully.

Not that it would really matter for performance (switching cpus between
cpupools is a _very_ rare operation), I'm fine transforming the barrier
into smp_wmb().


Juergen
Jan Beulich June 12, 2019, 9:32 a.m. UTC | #6
>>> On 12.06.19 at 10:19, <jgross@suse.com> wrote:
> On 12.06.19 10:05, Andrew Cooper wrote:
>> On 28/05/2019 11:32, Juergen Gross wrote:
>>> @@ -1870,8 +1871,19 @@ int schedule_cpu_switch(unsigned int cpu, struct 
> cpupool *c)
>>>       old_lock = pcpu_schedule_lock_irq(cpu);
>>>   
>>>       vpriv_old = idle->sched_priv;
>>> -    ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
>>> -    sched_switch_sched(new_ops, cpu, ppriv, vpriv);
>>> +    ppriv_old = sd->sched_priv;
>>> +    new_lock = sched_switch_sched(new_ops, cpu, ppriv, vpriv);
>>> +
>>> +    per_cpu(scheduler, cpu) = new_ops;
>>> +    sd->sched_priv = ppriv;
>>> +
>>> +    /*
>>> +     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
>>> +     * if it is free (and it can be) we want that anyone that manages
>>> +     * taking it, finds all the initializations we've done above in place.
>>> +     */
>>> +    smp_mb();
>> 
>> I realise you're just moving existing code, but this barrier sticks out
>> like a sore thumb.
>> 
>> A full memory barrier is a massive overhead for what should be
>> smp_wmb().  The matching barrier is actually hidden in the implicit
>> semantics of managing to lock sd->schedule_lock (which is trial an error
>> anyway), but the only thing that matters here is that all other written
>> data is in place first.
>> 
>> Beyond that, local causality will cause all reads to be in order (not
>> that the are important) due to logic dependencies.  Any that miss out on
>> this are a optimisation-waiting-to-happen as the compiler could elide
>> them fully.
> 
> Not that it would really matter for performance (switching cpus between
> cpupools is a _very_ rare operation), I'm fine transforming the barrier
> into smp_wmb().

This again is a change easy enough to make while committing. I'm
recording the above in case I end up being the committer.

Jan
Dario Faggioli June 12, 2019, 9:44 a.m. UTC | #7
On Wed, 2019-06-12 at 10:06 +0200, Juergen Gross wrote:
> On 12.06.19 09:40, Jan Beulich wrote:
> > > > > On 11.06.19 at 18:55, <dfaggioli@suse.com> wrote:
> > > On Tue, 2019-05-28 at 12:32 +0200, Juergen Gross wrote:
> > > > Signed-off-by: Juergen Gross <jgross@suse.com>
> > > > 
> > > Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
> > > 
> > > I'm wondering whether it make sense for the above to be quickly
> > > mentioned in the changelog, but am leaning toward "not really
> > > necessary". In particular, I don't think it's worth to respin the
> > > patch
> > > just for that... So, either just something that can be added
> > > while
> > > committing, or forget it.
> > 
> > I'd be happy to add something while committing, but one of you
> > would need to propose the wording of this "something".
> 
> What about:
> 
> It should be noted that in credit2 the lock used to be set while
> still
> holding the global scheduler write lock, which will no longer be true
> with the new scheme applied. This is actually no problem as the write
> lock is meant to guard the call of init_pdata() which still is true.
> 
I'm ok with this.

Thanks (to both) and Regards
Dario Faggioli June 12, 2019, 9:56 a.m. UTC | #8
On Wed, 2019-06-12 at 10:32 +0100, Jan Beulich wrote:
> > > > On 12.06.19 at 10:19, <jgross@suse.com> wrote:
> > On 12.06.19 10:05, Andrew Cooper wrote:
> > > On 28/05/2019 11:32, Juergen Gross wrote:
> > > > 
> > > > +    per_cpu(scheduler, cpu) = new_ops;
> > > > +    sd->sched_priv = ppriv;
> > > > +
> > > > +    /*
> > > > +     * (Re?)route the lock to the per pCPU lock as /last/
> > > > thing. In fact,
> > > > +     * if it is free (and it can be) we want that anyone that
> > > > manages
> > > > +     * taking it, finds all the initializations we've done
> > > > above in place.
> > > > +     */
> > > > +    smp_mb();
> > > 
> > > A full memory barrier is a massive overhead for what should be
> > > smp_wmb().  The matching barrier is actually hidden in the
> > > implicit
> > > semantics of managing to lock sd->schedule_lock (which is trial
> > > an error
> > > anyway), but the only thing that matters here is that all other
> > > written
> > > data is in place first.
> > > 
> > Not that it would really matter for performance (switching cpus
> > between
> > cpupools is a _very_ rare operation), I'm fine transforming the
> > barrier
> > into smp_wmb().
> 
> This again is a change easy enough to make while committing. I'm
> recording the above in case I end up being the committer.
> 
I'm fine (i.e., my Rev-by: remains valid) with this being turned into a
wmb(), and it being done upon commit (thanks Jan for the offer to do
that!).

If we do it, however, I think the comment needs to be adjusted too, and
the commit message needs to briefly mention this new change we're
doing.

Maybe, for the comment, add something like:

"The smp_wmb() corresponds to the barrier implicit in successfully
taking the lock."

And, for the changelog, add:

"While there, turn the full barrier, which was overkill, into an
smp_wmb(), matching with the one implicit in managing to take the
lock."

Or something similar (and again, R-b: still stands with such changes).

Regards
Jan Beulich June 12, 2019, 10:14 a.m. UTC | #9
>>> On 12.06.19 at 11:56, <dfaggioli@suse.com> wrote:
> On Wed, 2019-06-12 at 10:32 +0100, Jan Beulich wrote:
>> > > > On 12.06.19 at 10:19, <jgross@suse.com> wrote:
>> > On 12.06.19 10:05, Andrew Cooper wrote:
>> > > On 28/05/2019 11:32, Juergen Gross wrote:
>> > > > 
>> > > > +    per_cpu(scheduler, cpu) = new_ops;
>> > > > +    sd->sched_priv = ppriv;
>> > > > +
>> > > > +    /*
>> > > > +     * (Re?)route the lock to the per pCPU lock as /last/
>> > > > thing. In fact,
>> > > > +     * if it is free (and it can be) we want that anyone that
>> > > > manages
>> > > > +     * taking it, finds all the initializations we've done
>> > > > above in place.
>> > > > +     */
>> > > > +    smp_mb();
>> > > 
>> > > A full memory barrier is a massive overhead for what should be
>> > > smp_wmb().  The matching barrier is actually hidden in the
>> > > implicit
>> > > semantics of managing to lock sd->schedule_lock (which is trial
>> > > an error
>> > > anyway), but the only thing that matters here is that all other
>> > > written
>> > > data is in place first.
>> > > 
>> > Not that it would really matter for performance (switching cpus
>> > between
>> > cpupools is a _very_ rare operation), I'm fine transforming the
>> > barrier
>> > into smp_wmb().
>> 
>> This again is a change easy enough to make while committing. I'm
>> recording the above in case I end up being the committer.
>> 
> I'm fine (i.e., my Rev-by: remains valid) with this being turned into a
> wmb(), and it being done upon commit (thanks Jan for the offer to do
> that!).
> 
> If we do it, however, I think the comment needs to be adjusted too, and
> the commit message needs to briefly mention this new change we're
> doing.
> 
> Maybe, for the comment, add something like:
> 
> "The smp_wmb() corresponds to the barrier implicit in successfully
> taking the lock."

I'm not entirely happy with this one: Taking a lock is an implicit full
barrier, so cannot alone be used to reason why a write barrier is
sufficient here.

> And, for the changelog, add:
> 
> "While there, turn the full barrier, which was overkill, into an
> smp_wmb(), matching with the one implicit in managing to take the
> lock."

This one reads fine to me.

Jan
Andrew Cooper June 12, 2019, 11:27 a.m. UTC | #10
On 12/06/2019 11:14, Jan Beulich wrote:
>>>> On 12.06.19 at 11:56, <dfaggioli@suse.com> wrote:
>> On Wed, 2019-06-12 at 10:32 +0100, Jan Beulich wrote:
>>>>>> On 12.06.19 at 10:19, <jgross@suse.com> wrote:
>>>> On 12.06.19 10:05, Andrew Cooper wrote:
>>>>> On 28/05/2019 11:32, Juergen Gross wrote:
>>>>>> +    per_cpu(scheduler, cpu) = new_ops;
>>>>>> +    sd->sched_priv = ppriv;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * (Re?)route the lock to the per pCPU lock as /last/
>>>>>> thing. In fact,
>>>>>> +     * if it is free (and it can be) we want that anyone that
>>>>>> manages
>>>>>> +     * taking it, finds all the initializations we've done
>>>>>> above in place.
>>>>>> +     */
>>>>>> +    smp_mb();
>>>>> A full memory barrier is a massive overhead for what should be
>>>>> smp_wmb().  The matching barrier is actually hidden in the
>>>>> implicit
>>>>> semantics of managing to lock sd->schedule_lock (which is trial
>>>>> an error
>>>>> anyway), but the only thing that matters here is that all other
>>>>> written
>>>>> data is in place first.
>>>>>
>>>> Not that it would really matter for performance (switching cpus
>>>> between
>>>> cpupools is a _very_ rare operation), I'm fine transforming the
>>>> barrier
>>>> into smp_wmb().
>>> This again is a change easy enough to make while committing. I'm
>>> recording the above in case I end up being the committer.
>>>
>> I'm fine (i.e., my Rev-by: remains valid) with this being turned into a
>> wmb(), and it being done upon commit (thanks Jan for the offer to do
>> that!).
>>
>> If we do it, however, I think the comment needs to be adjusted too, and
>> the commit message needs to briefly mention this new change we're
>> doing.
>>
>> Maybe, for the comment, add something like:
>>
>> "The smp_wmb() corresponds to the barrier implicit in successfully
>> taking the lock."
> I'm not entirely happy with this one: Taking a lock is an implicit full
> barrier, so cannot alone be used to reason why a write barrier is
> sufficient here.

It is a consequence of our extra magic scheduler locks which protect the
pointer used to locate them, and the ensuing double-step in ???
(seriously - I can't figure out the function names created by the
sched_lock() monstrosity) which take the lock, re-read the lock pointer
and drop on a mismatch.

I've gone with:

+    /*
+     * The data above is protected under new_lock, which may be unlocked.
+     * Another CPU can take new_lock as soon as sd->schedule_lock is
visible,
+     * and must observe all prior initialisation.
+     */
+    smp_wmb();

~Andrew
Dario Faggioli June 12, 2019, 1:32 p.m. UTC | #11
On Wed, 2019-06-12 at 12:27 +0100, Andrew Cooper wrote:
> It is a consequence of our extra magic scheduler locks which protect
> the
> pointer used to locate them, and the ensuing double-step in ???
> (seriously - I can't figure out the function names created by the
> sched_lock() monstrosity) which take the lock, re-read the lock
> pointer
> and drop on a mismatch.
> 
Just FTR, they're:

vcpu_schedule_lock()
vcpu_schedule_lock_irq()
vcpu_schedule_lock_irqsave()

and:

pcpu_schedule_lock()
pcpu_schedule_lock_irq()
pcpu_schedule_lock_irqsave()

and the corresponding *_schedule_unlock_*() ones, of course. ;-P

Regards
diff mbox series

Patch

diff --git a/xen/common/sched_arinc653.c b/xen/common/sched_arinc653.c
index a4c6d00b81..72b988ea5f 100644
--- a/xen/common/sched_arinc653.c
+++ b/xen/common/sched_arinc653.c
@@ -630,7 +630,7 @@  a653sched_pick_cpu(const struct scheduler *ops, struct vcpu *vc)
  * @param pdata     scheduler specific PCPU data (we don't have any)
  * @param vdata     scheduler specific VCPU data of the idle vcpu
  */
-static void
+static spinlock_t *
 a653_switch_sched(struct scheduler *new_ops, unsigned int cpu,
                   void *pdata, void *vdata)
 {
@@ -641,17 +641,7 @@  a653_switch_sched(struct scheduler *new_ops, unsigned int cpu,
 
     idle_vcpu[cpu]->sched_priv = vdata;
 
-    per_cpu(scheduler, cpu) = new_ops;
-    per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */
-
-    /*
-     * (Re?)route the lock to its default location. We actually do not use
-     * it, but if we leave it pointing to where it does now (i.e., the
-     * runqueue lock for this PCPU in the default scheduler), we'd be
-     * causing unnecessary contention on that lock (in cases where it is
-     * shared among multiple PCPUs, like in Credit2 and RTDS).
-     */
-    sd->schedule_lock = &sd->_lock;
+    return &sd->_lock;
 }
 
 /**
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 7b7facbace..621c408ed0 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -631,7 +631,7 @@  csched_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
 }
 
 /* Change the scheduler of cpu to us (Credit). */
-static void
+static spinlock_t *
 csched_switch_sched(struct scheduler *new_ops, unsigned int cpu,
                     void *pdata, void *vdata)
 {
@@ -653,16 +653,7 @@  csched_switch_sched(struct scheduler *new_ops, unsigned int cpu,
     init_pdata(prv, pdata, cpu);
     spin_unlock(&prv->lock);
 
-    per_cpu(scheduler, cpu) = new_ops;
-    per_cpu(schedule_data, cpu).sched_priv = pdata;
-
-    /*
-     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
-     * if it is free (and it can be) we want that anyone that manages
-     * taking it, finds all the initializations we've done above in place.
-     */
-    smp_mb();
-    sd->schedule_lock = &sd->_lock;
+    return &sd->_lock;
 }
 
 #ifndef NDEBUG
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 9c1c3b4e08..8e4381d8a7 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -3855,7 +3855,7 @@  csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
 }
 
 /* Change the scheduler of cpu to us (Credit2). */
-static void
+static spinlock_t *
 csched2_switch_sched(struct scheduler *new_ops, unsigned int cpu,
                      void *pdata, void *vdata)
 {
@@ -3888,18 +3888,9 @@  csched2_switch_sched(struct scheduler *new_ops, unsigned int cpu,
      */
     ASSERT(per_cpu(schedule_data, cpu).schedule_lock != &prv->rqd[rqi].lock);
 
-    per_cpu(scheduler, cpu) = new_ops;
-    per_cpu(schedule_data, cpu).sched_priv = pdata;
-
-    /*
-     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
-     * if it is free (and it can be) we want that anyone that manages
-     * taking it, find all the initializations we've done above in place.
-     */
-    smp_mb();
-    per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock;
-
     write_unlock(&prv->lock);
+
+    return &prv->rqd[rqi].lock;
 }
 
 static void
diff --git a/xen/common/sched_null.c b/xen/common/sched_null.c
index a59dbb2692..10427b37ab 100644
--- a/xen/common/sched_null.c
+++ b/xen/common/sched_null.c
@@ -381,8 +381,9 @@  static void vcpu_deassign(struct null_private *prv, struct vcpu *v,
 }
 
 /* Change the scheduler of cpu to us (null). */
-static void null_switch_sched(struct scheduler *new_ops, unsigned int cpu,
-                              void *pdata, void *vdata)
+static spinlock_t *null_switch_sched(struct scheduler *new_ops,
+                                     unsigned int cpu,
+                                     void *pdata, void *vdata)
 {
     struct schedule_data *sd = &per_cpu(schedule_data, cpu);
     struct null_private *prv = null_priv(new_ops);
@@ -401,16 +402,7 @@  static void null_switch_sched(struct scheduler *new_ops, unsigned int cpu,
 
     init_pdata(prv, cpu);
 
-    per_cpu(scheduler, cpu) = new_ops;
-    per_cpu(schedule_data, cpu).sched_priv = pdata;
-
-    /*
-     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
-     * if it is free (and it can be) we want that anyone that manages
-     * taking it, finds all the initializations we've done above in place.
-     */
-    smp_mb();
-    sd->schedule_lock = &sd->_lock;
+    return &sd->_lock;
 }
 
 static void null_vcpu_insert(const struct scheduler *ops, struct vcpu *v)
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index f1b81f0373..0acfc3d702 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -729,7 +729,7 @@  rt_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
 }
 
 /* Change the scheduler of cpu to us (RTDS). */
-static void
+static spinlock_t *
 rt_switch_sched(struct scheduler *new_ops, unsigned int cpu,
                 void *pdata, void *vdata)
 {
@@ -761,16 +761,8 @@  rt_switch_sched(struct scheduler *new_ops, unsigned int cpu,
     }
 
     idle_vcpu[cpu]->sched_priv = vdata;
-    per_cpu(scheduler, cpu) = new_ops;
-    per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */
 
-    /*
-     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
-     * if it is free (and it can be) we want that anyone that manages
-     * taking it, find all the initializations we've done above in place.
-     */
-    smp_mb();
-    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
+    return &prv->lock;
 }
 
 static void
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 4da970c543..6dc96b3cd4 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -1812,7 +1812,8 @@  int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
     struct scheduler *old_ops = per_cpu(scheduler, cpu);
     struct scheduler *new_ops = (c == NULL) ? &ops : c->sched;
     struct cpupool *old_pool = per_cpu(cpupool, cpu);
-    spinlock_t * old_lock;
+    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
+    spinlock_t *old_lock, *new_lock;
 
     /*
      * pCPUs only move from a valid cpupool to free (i.e., out of any pool),
@@ -1870,8 +1871,19 @@  int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
     old_lock = pcpu_schedule_lock_irq(cpu);
 
     vpriv_old = idle->sched_priv;
-    ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
-    sched_switch_sched(new_ops, cpu, ppriv, vpriv);
+    ppriv_old = sd->sched_priv;
+    new_lock = sched_switch_sched(new_ops, cpu, ppriv, vpriv);
+
+    per_cpu(scheduler, cpu) = new_ops;
+    sd->sched_priv = ppriv;
+
+    /*
+     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
+     * if it is free (and it can be) we want that anyone that manages
+     * taking it, finds all the initializations we've done above in place.
+     */
+    smp_mb();
+    sd->schedule_lock = new_lock;
 
     /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
     spin_unlock_irq(old_lock);
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index b3c3e189d9..b8e2b2e49e 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -153,7 +153,7 @@  struct scheduler {
     /* Idempotent. */
     void         (*free_domdata)   (const struct scheduler *, void *);
 
-    void         (*switch_sched)   (struct scheduler *, unsigned int,
+    spinlock_t * (*switch_sched)   (struct scheduler *, unsigned int,
                                     void *, void *);
 
     /* Activate / deactivate vcpus in a cpu pool */
@@ -195,10 +195,11 @@  static inline void sched_deinit(struct scheduler *s)
     s->deinit(s);
 }
 
-static inline void sched_switch_sched(struct scheduler *s, unsigned int cpu,
-                                      void *pdata, void *vdata)
+static inline spinlock_t *sched_switch_sched(struct scheduler *s,
+                                             unsigned int cpu,
+                                             void *pdata, void *vdata)
 {
-    s->switch_sched(s, cpu, pdata, vdata);
+    return s->switch_sched(s, cpu, pdata, vdata);
 }
 
 static inline void sched_dump_settings(const struct scheduler *s)