diff mbox

[09/16] xen: sched: close potential races when switching scheduler to CPUs

Message ID 20160318190505.8117.89778.stgit@Solace.station (mailing list archive)
State New, archived
Headers show

Commit Message

Dario Faggioli March 18, 2016, 7:05 p.m. UTC
by using the sched_switch hook that we have introduced in
the various schedulers.

The key is to let the actual switch of scheduler and the
remapping of the scheduler lock for the CPU (if necessary)
happen together (in the same critical section) protected
(at least) by the old scheduler lock for the CPU.

This also means that, in Credit2 and RTDS, we can get rid
of the code that was doing the scheduler lock remapping
in csched2_free_pdata() and rt_free_pdata(), and of their
triggering ASSERT-s.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Meng Xu <mengxu@cis.upenn.edu>
Cc: Tianyang Chen <tiche@seas.upenn.edu>
---
 xen/common/sched_credit.c  |    9 +++++++++
 xen/common/sched_credit2.c |   28 ++++++++++------------------
 xen/common/sched_rt.c      |   13 -------------
 xen/common/schedule.c      |   30 +++++++++++++++++++++---------
 4 files changed, 40 insertions(+), 40 deletions(-)

Comments

Meng Xu March 19, 2016, 2:25 a.m. UTC | #1
On Fri, Mar 18, 2016 at 3:05 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> by using the sched_switch hook that we have introduced in
> the various schedulers.
>
> The key is to let the actual switch of scheduler and the
> remapping of the scheduler lock for the CPU (if necessary)
> happen together (in the same critical section) protected
> (at least) by the old scheduler lock for the CPU.
>
> This also means that, in Credit2 and RTDS, we can get rid
> of the code that was doing the scheduler lock remapping
> in csched2_free_pdata() and rt_free_pdata(), and of their
> triggering ASSERT-s.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Meng Xu <mengxu@cis.upenn.edu>
> Cc: Tianyang Chen <tiche@seas.upenn.edu>
> ---
>  xen/common/sched_credit.c  |    9 +++++++++
>  xen/common/sched_credit2.c |   28 ++++++++++------------------
>  xen/common/sched_rt.c      |   13 -------------
>  xen/common/schedule.c      |   30 +++++++++++++++++++++---------

For the sched_rt.c and schedule.c,
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/
George Dunlap March 23, 2016, 7:05 p.m. UTC | #2
On 18/03/16 19:05, Dario Faggioli wrote:
> by using the sched_switch hook that we have introduced in
> the various schedulers.
> 
> The key is to let the actual switch of scheduler and the
> remapping of the scheduler lock for the CPU (if necessary)
> happen together (in the same critical section) protected
> (at least) by the old scheduler lock for the CPU.
> 
> This also means that, in Credit2 and RTDS, we can get rid
> of the code that was doing the scheduler lock remapping
> in csched2_free_pdata() and rt_free_pdata(), and of their
> triggering ASSERT-s.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Similar to my comment before -- in my own tree I squashed patches 6-9
into a single commit and found it much easier to review. :-)

One important question...

> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 1adc0e2..29582a6 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -1617,7 +1617,6 @@ void __init scheduler_init(void)
>  int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>  {
>      struct vcpu *idle;
> -    spinlock_t *lock;
>      void *ppriv, *ppriv_old, *vpriv, *vpriv_old;
>      struct scheduler *old_ops = per_cpu(scheduler, cpu);
>      struct scheduler *new_ops = (c == NULL) ? &ops : c->sched;
> @@ -1640,11 +1639,21 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>      if ( old_ops == new_ops )
>          goto out;
>  
> +    /*
> +     * To setup the cpu for the new scheduler we need:
> +     *  - a valid instance of per-CPU scheduler specific data, as it is
> +     *    allocated by SCHED_OP(alloc_pdata). Note that we do not want to
> +     *    initialize it yet (i.e., we are not calling SCHED_OP(init_pdata)).
> +     *    That will be done by the target scheduler, in SCHED_OP(switch_sched),
> +     *    in proper ordering and with locking.
> +     *  - a valid instance of per-vCPU scheduler specific data, for the idle
> +     *    vCPU of cpu. That is what the target scheduler will use for the
> +     *    sched_priv field of the per-vCPU info of the idle domain.
> +     */
>      idle = idle_vcpu[cpu];
>      ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
>      if ( IS_ERR(ppriv) )
>          return PTR_ERR(ppriv);
> -    SCHED_OP(new_ops, init_pdata, ppriv, cpu);
>      vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv);
>      if ( vpriv == NULL )
>      {
> @@ -1652,17 +1661,20 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>          return -ENOMEM;
>      }
>  
> -    lock = pcpu_schedule_lock_irq(cpu);
> -
>      SCHED_OP(old_ops, tick_suspend, cpu);
> +
> +    /*
> +     * The actual switch, including (if necessary) the rerouting of the
> +     * scheduler lock to whatever new_ops prefers,  needs to happen in one
> +     * critical section, protected by old_ops' lock, or races are possible.
> +     * Since each scheduler has its own contraints and locking scheme, do
> +     * that inside specific scheduler code, rather than here.
> +     */
>      vpriv_old = idle->sched_priv;
> -    idle->sched_priv = vpriv;
> -    per_cpu(scheduler, cpu) = new_ops;
>      ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
> -    per_cpu(schedule_data, cpu).sched_priv = ppriv;
> -    SCHED_OP(new_ops, tick_resume, cpu);
> +    SCHED_OP(new_ops, switch_sched, cpu, ppriv, vpriv);

Is it really safe to read sched_priv without the lock held?

 -George
George Dunlap March 24, 2016, 12:14 p.m. UTC | #3
On 18/03/16 19:05, Dario Faggioli wrote:
> by using the sched_switch hook that we have introduced in
> the various schedulers.
> 
> The key is to let the actual switch of scheduler and the
> remapping of the scheduler lock for the CPU (if necessary)
> happen together (in the same critical section) protected
> (at least) by the old scheduler lock for the CPU.

Thanks for trying to sort this out -- I've been looking this since
yesterday afternoon and it certainly makes my head hurt. :-)

It looks like you want to do the locking inside the sched_switch()
callback, rather than outside of it, so that you can get the locking
order right (global private before per-cpu scheduler lock).  Otherwise
you could just have schedule_cpu_switch grab and release the lock, and
let the sched_switch() callback set the lock as needed (knowing that the
correct lock is already held and will be released).

But the ordering between prv->lock and the scheduler lock only needs to
be between the prv lock and scheduler lock *of a specific instance* of
the credit2 scheduler -- i.e., between prv->lock and prv->rqd[].lock.

And, critically, if we're calling sched_switch, then we already know
that the current pcpu lock is *not* one of the prv->rqd[].lock's because
we check that at the top of schedule_cpu_switch().

So I think there should be no problem with:
1. Grabbing the pcpu schedule lock in schedule_cpu_switch()
2. Grabbing prv->lock in csched2_switch_sched()
3. Setting the per_cpu schedule lock as the very last thing in
csched2_switch_sched()
4. Releasing the (old) pcpu schedule lock in schedule_cpu_switch().

What do you think?

That would allow us to read ppriv_old and vpriv_old with the
schedule_lock held.

Unfortunately I can't off the top of my head think of a good assertion
to put in at #2 to assert that the per-pcpu lock is *not* one of
runqueue locks in prv, because we don't yet know which runqueue this cpu
will be assigned to.  But we could check when we actually do the lock
assignment to make sure that it's not already equal.  That way we'll
either deadlock or ASSERT (which is not as good as always ASSERTing, but
is better than either deadlocking or working fine).

As an aside -- it seems to me that as soon as we change the scheduler
lock, there's a risk that something else may come along and try to grab
it / access the data.  Does that mean we really ought to use memory
barriers to make sure that the lock is written only after all changes to
the scheduler data have been appropriately made?

> This also means that, in Credit2 and RTDS, we can get rid
> of the code that was doing the scheduler lock remapping
> in csched2_free_pdata() and rt_free_pdata(), and of their
> triggering ASSERT-s.

Right -- so to put it a different way, *all* schedulers must now set the
locking scheme they wish to use, even if they want to use the default
per-cpu locks.  I think that means we have to do that for arinc653 too,
right?

At first I thought we could look at having schedule_cpu_switch() always
reset the lock before calling the switch_sched() callback; but if my
comment about memory barriers is accurate, then that won't work either.
 In any case, there are only 4 schedulers, so it's not that hard to just
have them all set the locking scheme they want.

 -George
Dario Faggioli April 5, 2016, 4:26 p.m. UTC | #4
On Wed, 2016-03-23 at 19:05 +0000, George Dunlap wrote:
> On 18/03/16 19:05, Dario Faggioli wrote:
> > 
> > by using the sched_switch hook that we have introduced in
> > the various schedulers.
> > 
> > The key is to let the actual switch of scheduler and the
> > remapping of the scheduler lock for the CPU (if necessary)
> > happen together (in the same critical section) protected
> > (at least) by the old scheduler lock for the CPU.
> > 
> > This also means that, in Credit2 and RTDS, we can get rid
> > of the code that was doing the scheduler lock remapping
> > in csched2_free_pdata() and rt_free_pdata(), and of their
> > triggering ASSERT-s.
> > 
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Similar to my comment before -- in my own tree I squashed patches 6-9
> into a single commit and found it much easier to review. :-)
> 
I understand your point.

I'll consider doing something like this for v2 (that I'm just finishing
putting together), but I'm not sure I like it.

For instance, although the issue has the same roots and similar
consequences for all schedulers, the actual race is different between
Credit1 and Credit2 (RTDS is the same as Credit2), and having distinct
patches for each scheduler allows me to describe both the situations in
details, in their respective changelog, without the changelogs
themselves becoming too long (they're actually quite long already!!).

> One important question...
> > 
> > --- a/xen/common/schedule.c
> > +++ b/xen/common/schedule.c
> > 
> > @@ -1652,17 +1661,20 @@ int schedule_cpu_switch(unsigned int cpu,
> > struct cpupool *c)
> >          return -ENOMEM;
> >      }
> >  
> > -    lock = pcpu_schedule_lock_irq(cpu);
> > -
> >      SCHED_OP(old_ops, tick_suspend, cpu);
> > +
> > +    /*
> > +     * The actual switch, including (if necessary) the rerouting
> > of the
> > +     * scheduler lock to whatever new_ops prefers,  needs to
> > happen in one
> > +     * critical section, protected by old_ops' lock, or races are
> > possible.
> > +     * Since each scheduler has its own contraints and locking
> > scheme, do
> > +     * that inside specific scheduler code, rather than here.
> > +     */
> >      vpriv_old = idle->sched_priv;
> > -    idle->sched_priv = vpriv;
> > -    per_cpu(scheduler, cpu) = new_ops;
> >      ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
> > -    per_cpu(schedule_data, cpu).sched_priv = ppriv;
> > -    SCHED_OP(new_ops, tick_resume, cpu);
> > +    SCHED_OP(new_ops, switch_sched, cpu, ppriv, vpriv);
> Is it really safe to read sched_priv without the lock held?
> 
So, you put down a lot more reasoning on this issue in another email,
and I'll reply in more length to that one.

But just about this specific thing. We're in schedule_cpu_switch(), and
schedule_cpu_switch() is indeed the only function that changes the
content of sd->sched_priv, when the system is _live_. It both reads the
old pointer, stash it, allocate the new one, assign it, and free the
old one. It's therefore only because of this function that a race can
happen.

In fact, the only other situation where sched_priv changes is during
cpu bringup (CPU_UP_PREPARE phase), or teardown. But those cases are
not of much concern (and, in fact, there's no locking in there,
independently from this series).

Now, schedule_cpu_switch is called by:

1 cpupool.c  cpupool_assign_cpu_locked    268 ret = schedule_cpu_switch(cpu, c);
2 cpupool.c  cpupool_unassign_cpu_helper  325 ret = schedule_cpu_switch(cpu, NULL);

to move the cpu inside or outside from a cpupool, and in both cases, we
have taken the cpupool_lock spinlock already when calling it.

So, yes, it looks to me that sched_priv is safe to be manipulated like
the patch is doing... Am I overlooking something?

Thanks and Regards,
Dario
Dario Faggioli April 5, 2016, 5:37 p.m. UTC | #5
On Thu, 2016-03-24 at 12:14 +0000, George Dunlap wrote:
> On 18/03/16 19:05, Dario Faggioli wrote:
> > 
> > by using the sched_switch hook that we have introduced in
> > the various schedulers.
> > 
> > The key is to let the actual switch of scheduler and the
> > remapping of the scheduler lock for the CPU (if necessary)
> > happen together (in the same critical section) protected
> > (at least) by the old scheduler lock for the CPU.
> Thanks for trying to sort this out -- I've been looking this since
> yesterday afternoon and it certainly makes my head hurt. :-)
> 
> It looks like you want to do the locking inside the sched_switch()
> callback, rather than outside of it, so that you can get the locking
> order right (global private before per-cpu scheduler lock).  
>
Yes, especially considering that such locking order varies with the
scheduler. In fact, in Credit1, it's per-cpu first, global second; in
Credit2, it's the other way round: global first, per-runq second. In
RTDS there is only one global lock and in ARINC only the per-cpu locks.

> Otherwise
> you could just have schedule_cpu_switch grab and release the lock,
> and
> let the sched_switch() callback set the lock as needed (knowing that
> the
> correct lock is already held and will be released).
> 
Yeah, that's after all what is being done even right now. Of course,
right now it's buggy, but even if it probably can be made to work, in
the way you suggest, I thought bringing anything that has to do with
the locking order required by a specific scheduler _inside_ the
specific scheduler code would be a good thing.

I actually still think that, but I recognize the ugliness of accessing
ppriv_old and vpriv_old outside of scheduler locks (although, as I said
in the other email, it should be safe, as switching scheduler is
serialized by cpupool_lock anyway... perhaps I should put this in a
comment), not to mention how ugly it is the interface provided by
sched_switch() (talking about the last two parameters :-/).

> But the ordering between prv->lock and the scheduler lock only needs
> to
> be between the prv lock and scheduler lock *of a specific instance*
> of
> the credit2 scheduler -- i.e., between prv->lock and prv->rqd[].lock.
> 
True...

> And, critically, if we're calling sched_switch, then we already know
> that the current pcpu lock is *not* one of the prv->rqd[].lock's
> because
> we check that at the top of schedule_cpu_switch().
> 
... True again...

> So I think there should be no problem with:
> 1. Grabbing the pcpu schedule lock in schedule_cpu_switch()
> 2. Grabbing prv->lock in csched2_switch_sched()
> 3. Setting the per_cpu schedule lock as the very last thing in
> csched2_switch_sched()
> 4. Releasing the (old) pcpu schedule lock in schedule_cpu_switch().
> 
> What do you think?
> 
I think it should work. We'll be doing the scheduler lock manipulation
protected by the old and wrong (as it's the one of another scheduler)
per-cpu/runq lock, and the correct global private lock. It would look
like the ordering between the two locks is the wrong one, in Credit2,
but it's not because of the fact that the per-runq lock is the other
scheduler's one.

Tricky, but everything is in here! :-/

I'm not sure whether it's better or worse wrt what I have. I'm trying
coding it up, to see how it looks like (and if it allows me to get rid
of the ugliness in sched_switch()).

> That would allow us to read ppriv_old and vpriv_old with the
> schedule_lock held.
> 
Also good. :-)

> Unfortunately I can't off the top of my head think of a good
> assertion
> to put in at #2 to assert that the per-pcpu lock is *not* one of
> runqueue locks in prv, because we don't yet know which runqueue this
> cpu
> will be assigned to.  But we could check when we actually do the lock
> assignment to make sure that it's not already equal.  That way we'll
> either deadlock or ASSERT (which is not as good as always ASSERTing,
> but
> is better than either deadlocking or working fine).
> 
Yep, I don't think we can do much better than that.

> As an aside -- it seems to me that as soon as we change the scheduler
> lock, there's a risk that something else may come along and try to
> grab
> it / access the data.  Does that mean we really ought to use memory
> barriers to make sure that the lock is written only after all changes
> to
> the scheduler data have been appropriately made?
> 
Yes, if it were only for this code, I think you're right, barriers
would be necessary. I once again think this is actually safe, because
it's serialized elsewhere, but thinking more about it, I can well add
both barriers (and a comment).

> > 
> > This also means that, in Credit2 and RTDS, we can get rid
> > of the code that was doing the scheduler lock remapping
> > in csched2_free_pdata() and rt_free_pdata(), and of their
> > triggering ASSERT-s.
> Right -- so to put it a different way, *all* schedulers must now set
> the
> locking scheme they wish to use, even if they want to use the default
> per-cpu locks.  
>
Exactly.

> I think that means we have to do that for arinc653 too,
> right?
> 
Mmm... right, I'll have a look at that.

Thanks for looking at this, and sorry for the late reply.

Regards,
Dario
Dario Faggioli April 6, 2016, 3:51 p.m. UTC | #6
On Tue, 2016-04-05 at 18:26 +0200, Dario Faggioli wrote:
> On Wed, 2016-03-23 at 19:05 +0000, George Dunlap wrote:
> > On 18/03/16 19:05, Dario Faggioli wrote:
> > > This also means that, in Credit2 and RTDS, we can get rid
> > > of the code that was doing the scheduler lock remapping
> > > in csched2_free_pdata() and rt_free_pdata(), and of their
> > > triggering ASSERT-s.
> > > 
> > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> > Similar to my comment before -- in my own tree I squashed patches
> > 6-9
> > into a single commit and found it much easier to review. :-)
> > 
> I understand your point.
> 
> I'll consider doing something like this for v2 (that I'm just
> finishing
> putting together), but I'm not sure I like it.
> 
BTW, I changed my mind and, in the patch series I'm about to send, I
did as you suggest and squashed these patches into one. :-)

The changelog is indeed rather long, but still fine, IMO, and there's a
lot less duplication, both in `git log' and in code (if one looks at it
'patches after patches', rather than just the final result).

Thanks and Regards,
Dario
Dario Faggioli April 6, 2016, 4:21 p.m. UTC | #7
On Tue, 2016-04-05 at 19:37 +0200, Dario Faggioli wrote:
> On Thu, 2016-03-24 at 12:14 +0000, George Dunlap wrote:
> > On 18/03/16 19:05, Dario Faggioli wrote:
> > So I think there should be no problem with:
> > 1. Grabbing the pcpu schedule lock in schedule_cpu_switch()
> > 2. Grabbing prv->lock in csched2_switch_sched()
> > 3. Setting the per_cpu schedule lock as the very last thing in
> > csched2_switch_sched()
> > 4. Releasing the (old) pcpu schedule lock in schedule_cpu_switch().
> > 
> > What do you think?
> > 
> I think it should work. We'll be doing the scheduler lock
> manipulation
> protected by the old and wrong (as it's the one of another scheduler)
> per-cpu/runq lock, and the correct global private lock. It would look
> like the ordering between the two locks is the wrong one, in Credit2,
> but it's not because of the fact that the per-runq lock is the other
> scheduler's one.
> 
> Tricky, but everything is in here! :-/
> 
I've done it as you suggest above.

The new .switch_sched hook is still there, and still does look the
same. But I do indeed like the final look of the code better, and it
appears to be working ok.

Have a look. ;-)

> > As an aside -- it seems to me that as soon as we change the
> > scheduler
> > lock, there's a risk that something else may come along and try to
> > grab
> > it / access the data.  Does that mean we really ought to use memory
> > barriers to make sure that the lock is written only after all
> > changes
> > to
> > the scheduler data have been appropriately made?
> > 
> Yes, if it were only for this code, I think you're right, barriers
> would be necessary. I once again think this is actually safe, because
> it's serialized elsewhere, but thinking more about it, I can well add
> both barriers (and a comment).
> 
And I've added smp_mb()-s too.

> > > This also means that, in Credit2 and RTDS, we can get rid
> > > of the code that was doing the scheduler lock remapping
> > > in csched2_free_pdata() and rt_free_pdata(), and of their
> > > triggering ASSERT-s.
> > Right -- so to put it a different way, *all* schedulers must now
> > set
> > the
> > locking scheme they wish to use, even if they want to use the
> > default
> > per-cpu locks.  
> > 
> Exactly.
> 
> > 
> > I think that means we have to do that for arinc653 too,
> > right?
> > 
> Mmm... right, I'll have a look at that.
> 
And, finally, I did have a look at this too, and I actually don't think
ARINC needs any of this.

In fact, ARINC brings the idea of "doing its own locking" much further
than the other schedulers we have. They have their lock and they use it
in such a way that they don't even care to what
{v,p}cpu_schedule_lock() and friends points to.

As an example, check a653sched_do_schedule(). It's called from
schedule(), right after taking the runqueue lock,
with pcpu_schedule_lock_irq(), and yet it does this:

 spin_lock_irqsave(&sched_priv->lock, flags);

So, I actually better _not_ add anything to this series that re-maps
sd->schedule_lock to point to their sched_priv->lock, or we'd deadlock!

I'm not sure the design behind all this is the best possible one, but
that's a different issue, to be dealt with with another series in
another moment. :-)

In any case, I've added Robert and Josh.

Regards,
Dario
diff mbox

Patch

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 929ba9c..903a704 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -577,6 +577,15 @@  csched_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
 {
     unsigned long flags;
     struct csched_private *prv = CSCHED_PRIV(ops);
+    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
+
+    /*
+     * This is called either during during boot, resume or hotplug, in
+     * case Credit1 is the scheduler chosen at boot. In such cases, the
+     * scheduler lock for cpu is already pointing to the default per-cpu
+     * spinlock, as Credit1 needs it, so there is no remapping to be done.
+     */
+    ASSERT(sd->schedule_lock == &sd->_lock && !spin_is_locked(&sd->_lock));
 
     spin_lock_irqsave(&prv->lock, flags);
     init_pdata(prv, pdata, cpu);
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 25d8e85..64fb028 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1974,7 +1974,6 @@  init_pdata(struct csched2_private *prv, unsigned int cpu)
 {
     unsigned rqi;
     struct csched2_runqueue_data *rqd;
-    spinlock_t *old_lock;
 
     ASSERT(spin_is_locked(&prv->lock));
     ASSERT(!cpumask_test_cpu(cpu, &prv->initialized));
@@ -2005,21 +2004,11 @@  init_pdata(struct csched2_private *prv, unsigned int cpu)
         activate_runqueue(prv, rqi);
     }
     
-    /* IRQs already disabled */
-    old_lock = pcpu_schedule_lock(cpu);
-
-    /* Move spinlock to new runq lock.  */
-    per_cpu(schedule_data, cpu).schedule_lock = &rqd->lock;
-
     /* Set the runqueue map */
     prv->runq_map[cpu] = rqi;
     
     cpumask_set_cpu(cpu, &rqd->idle);
     cpumask_set_cpu(cpu, &rqd->active);
-
-    /* _Not_ pcpu_schedule_unlock(): per_cpu().schedule_lock changed! */
-    spin_unlock(old_lock);
-
     cpumask_set_cpu(cpu, &prv->initialized);
 
     return rqi;
@@ -2029,10 +2018,19 @@  static void
 csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
 {
     struct csched2_private *prv = CSCHED2_PRIV(ops);
+    spinlock_t *old_lock;
     unsigned long flags;
+    unsigned rqi;
 
     spin_lock_irqsave(&prv->lock, flags);
-    init_pdata(prv, cpu);
+    old_lock = pcpu_schedule_lock(cpu);
+
+    rqi = init_pdata(prv, cpu);
+    /* Move the scheduler lock to the new runq lock. */
+    per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock;
+
+    /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
+    spin_unlock(old_lock);
     spin_unlock_irqrestore(&prv->lock, flags);
 }
 
@@ -2079,7 +2077,6 @@  csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
     unsigned long flags;
     struct csched2_private *prv = CSCHED2_PRIV(ops);
     struct csched2_runqueue_data *rqd;
-    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
     int rqi;
 
     spin_lock_irqsave(&prv->lock, flags);
@@ -2107,11 +2104,6 @@  csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
         deactivate_runqueue(prv, rqi);
     }
 
-    /* Move spinlock to the original lock.  */
-    ASSERT(sd->schedule_lock == &rqd->lock);
-    ASSERT(!spin_is_locked(&sd->_lock));
-    sd->schedule_lock = &sd->_lock;
-
     spin_unlock(&rqd->lock);
 
     cpumask_clear_cpu(cpu, &prv->initialized);
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index 92be248..0564b1d 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -718,19 +718,6 @@  rt_alloc_pdata(const struct scheduler *ops, int cpu)
 static void
 rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 {
-    struct rt_private *prv = rt_priv(ops);
-    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
-    unsigned long flags;
-
-    spin_lock_irqsave(&prv->lock, flags);
-
-    /* Move spinlock back to the default lock */
-    ASSERT(sd->schedule_lock == &prv->lock);
-    ASSERT(!spin_is_locked(&sd->_lock));
-    sd->schedule_lock = &sd->_lock;
-
-    spin_unlock_irqrestore(&prv->lock, flags);
-
     free_cpumask_var(_cpumask_scratch[cpu]);
 }
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 1adc0e2..29582a6 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -1617,7 +1617,6 @@  void __init scheduler_init(void)
 int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
 {
     struct vcpu *idle;
-    spinlock_t *lock;
     void *ppriv, *ppriv_old, *vpriv, *vpriv_old;
     struct scheduler *old_ops = per_cpu(scheduler, cpu);
     struct scheduler *new_ops = (c == NULL) ? &ops : c->sched;
@@ -1640,11 +1639,21 @@  int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
     if ( old_ops == new_ops )
         goto out;
 
+    /*
+     * To setup the cpu for the new scheduler we need:
+     *  - a valid instance of per-CPU scheduler specific data, as it is
+     *    allocated by SCHED_OP(alloc_pdata). Note that we do not want to
+     *    initialize it yet (i.e., we are not calling SCHED_OP(init_pdata)).
+     *    That will be done by the target scheduler, in SCHED_OP(switch_sched),
+     *    in proper ordering and with locking.
+     *  - a valid instance of per-vCPU scheduler specific data, for the idle
+     *    vCPU of cpu. That is what the target scheduler will use for the
+     *    sched_priv field of the per-vCPU info of the idle domain.
+     */
     idle = idle_vcpu[cpu];
     ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
     if ( IS_ERR(ppriv) )
         return PTR_ERR(ppriv);
-    SCHED_OP(new_ops, init_pdata, ppriv, cpu);
     vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv);
     if ( vpriv == NULL )
     {
@@ -1652,17 +1661,20 @@  int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
         return -ENOMEM;
     }
 
-    lock = pcpu_schedule_lock_irq(cpu);
-
     SCHED_OP(old_ops, tick_suspend, cpu);
+
+    /*
+     * The actual switch, including (if necessary) the rerouting of the
+     * scheduler lock to whatever new_ops prefers,  needs to happen in one
+     * critical section, protected by old_ops' lock, or races are possible.
+     * Since each scheduler has its own contraints and locking scheme, do
+     * that inside specific scheduler code, rather than here.
+     */
     vpriv_old = idle->sched_priv;
-    idle->sched_priv = vpriv;
-    per_cpu(scheduler, cpu) = new_ops;
     ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
-    per_cpu(schedule_data, cpu).sched_priv = ppriv;
-    SCHED_OP(new_ops, tick_resume, cpu);
+    SCHED_OP(new_ops, switch_sched, cpu, ppriv, vpriv);
 
-    pcpu_schedule_unlock_irq(lock, cpu);
+    SCHED_OP(new_ops, tick_resume, cpu);
 
     SCHED_OP(old_ops, free_vdata, vpriv_old);
     SCHED_OP(old_ops, free_pdata, ppriv_old, cpu);