diff mbox

[v3,04/11] xen: sched: close potential races when switching scheduler to CPUs

Message ID 20160408012345.10762.40393.stgit@Solace.fritz.box (mailing list archive)
State New, archived
Headers show

Commit Message

Dario Faggioli April 8, 2016, 1:23 a.m. UTC
In short, the point is making sure that the actual switch
of scheduler and the remapping of the scheduler's runqueue
lock occur in the same critical section, protected by the
"old" scheduler's lock (and not, e.g., in the free_pdata
hook, as it is now for Credit2 and RTDS).

Not doing  so, is (at least) racy. In fact, for instance,
if we switch cpu X from, Credit2 to Credit, we do:

 schedule_cpu_switch(x, csched2 --> csched):
   //scheduler[x] is csched2
   //schedule_lock[x] is csched2_lock
   csched_alloc_pdata(x)
   csched_init_pdata(x)
   pcpu_schedule_lock(x) ----> takes csched2_lock
   scheduler[X] = csched
   pcpu_schedule_unlock(x) --> unlocks csched2_lock
   [1]
   csched2_free_pdata(x)
     pcpu_schedule_lock(x) --> takes csched2_lock
     schedule_lock[x] = csched_lock
     spin_unlock(csched2_lock)

While, if we switch cpu X from, Credit to Credit2, we do:

 schedule_cpu_switch(X, csched --> csched2):
   //scheduler[x] is csched
   //schedule_lock[x] is csched_lock
   csched2_alloc_pdata(x)
   csched2_init_pdata(x)
     pcpu_schedule_lock(x) --> takes csched_lock
     schedule_lock[x] = csched2_lock
     spin_unlock(csched_lock)
   [2]
   pcpu_schedule_lock(x) ----> takes csched2_lock
   scheduler[X] = csched2
   pcpu_schedule_unlock(x) --> unlocks csched2_lock
   csched_free_pdata(x)

And if we switch cpu X from RTDS to Credit2, we do:

 schedule_cpu_switch(X, RTDS --> csched2):
   //scheduler[x] is rtds
   //schedule_lock[x] is rtds_lock
   csched2_alloc_pdata(x)
   csched2_init_pdata(x)
     pcpu_schedule_lock(x) --> takes rtds_lock
     schedule_lock[x] = csched2_lock
     spin_unlock(rtds_lock)
   pcpu_schedule_lock(x) ----> takes csched2_lock
   scheduler[x] = csched2
   pcpu_schedule_unlock(x) --> unlocks csched2_lock
   rtds_free_pdata(x)
     spin_lock(rtds_lock)
     ASSERT(schedule_lock[x] == rtds_lock) [3]
     schedule_lock[x] = DEFAULT_SCHEDULE_LOCK [4]
     spin_unlock(rtds_lock)

So, the first problem is that, if anything related to
scheduling, and involving CPU, happens at [1] or [2], we:
 - take csched2_lock,
 - operate on Credit1 functions and data structures,
which is no good!

The second problem is that the ASSERT at [3] triggers, and
the third that at [4], we screw up the lock remapping we've
done for ourself in csched2_init_pdata()!

The first problem arises because there is a window during
which the lock is already the new one, but the scheduler is
still the old one. The other two, becase we let schedulers
mess with the lock (re)mapping done by others.

This patch, therefore, introduces a new hook in the scheduler
interface, called switch_sched, meant at being used when
switching scheduler on a CPU, and implements it for the
various schedulers, so that things are done in the proper
order and under the protection of the best suited (set of)
lock(s). It is necessary to add the hook (as compared to
keep doing things in generic code), because different
schedulers may have different locking schemes.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Josh Whitehead <josh.whitehead@dornerworks.com>
Cc: Robert VanVossen <robert.vanvossen@dornerworks.com>
Cc: Meng Xu <mengxu@cis.upenn.edu>
Cc: Tianyang Chen <tiche@seas.upenn.edu>
---
Changes from v2:
 - the hook is implemented in ARINC653 too, as necessary and
   as requested during review.

Changes from v1:

new patch, basically, coming from squashing what were
4 patches in v1. In any case, with respect to those 4
patches:
 - runqueue lock is back being taken in schedule_cpu_switch(),
   as suggested during review;
 - add barriers for making sure all initialization is done
   when the new lock is assigned, as sugested during review;
 - add comments and ASSERT-s about how and why the adopted
   locking scheme is safe, as suggested during review.
---
 xen/common/sched_arinc653.c |   34 ++++++++++++++++++
 xen/common/sched_credit.c   |   44 +++++++++++++++++++++++
 xen/common/sched_credit2.c  |   81 ++++++++++++++++++++++++++++++++-----------
 xen/common/sched_rt.c       |   45 +++++++++++++++++-------
 xen/common/schedule.c       |   41 +++++++++++++++++-----
 xen/include/xen/sched-if.h  |    3 ++
 6 files changed, 206 insertions(+), 42 deletions(-)

Comments

George Dunlap April 8, 2016, 12:52 p.m. UTC | #1
On 08/04/16 02:23, Dario Faggioli wrote:
> In short, the point is making sure that the actual switch
> of scheduler and the remapping of the scheduler's runqueue
> lock occur in the same critical section, protected by the
> "old" scheduler's lock (and not, e.g., in the free_pdata
> hook, as it is now for Credit2 and RTDS).
> 
> Not doing  so, is (at least) racy. In fact, for instance,
> if we switch cpu X from, Credit2 to Credit, we do:
> 
>  schedule_cpu_switch(x, csched2 --> csched):
>    //scheduler[x] is csched2
>    //schedule_lock[x] is csched2_lock
>    csched_alloc_pdata(x)
>    csched_init_pdata(x)
>    pcpu_schedule_lock(x) ----> takes csched2_lock
>    scheduler[X] = csched
>    pcpu_schedule_unlock(x) --> unlocks csched2_lock
>    [1]
>    csched2_free_pdata(x)
>      pcpu_schedule_lock(x) --> takes csched2_lock
>      schedule_lock[x] = csched_lock
>      spin_unlock(csched2_lock)
> 
> While, if we switch cpu X from, Credit to Credit2, we do:
> 
>  schedule_cpu_switch(X, csched --> csched2):
>    //scheduler[x] is csched
>    //schedule_lock[x] is csched_lock
>    csched2_alloc_pdata(x)
>    csched2_init_pdata(x)
>      pcpu_schedule_lock(x) --> takes csched_lock
>      schedule_lock[x] = csched2_lock
>      spin_unlock(csched_lock)
>    [2]
>    pcpu_schedule_lock(x) ----> takes csched2_lock
>    scheduler[X] = csched2
>    pcpu_schedule_unlock(x) --> unlocks csched2_lock
>    csched_free_pdata(x)
> 
> And if we switch cpu X from RTDS to Credit2, we do:
> 
>  schedule_cpu_switch(X, RTDS --> csched2):
>    //scheduler[x] is rtds
>    //schedule_lock[x] is rtds_lock
>    csched2_alloc_pdata(x)
>    csched2_init_pdata(x)
>      pcpu_schedule_lock(x) --> takes rtds_lock
>      schedule_lock[x] = csched2_lock
>      spin_unlock(rtds_lock)
>    pcpu_schedule_lock(x) ----> takes csched2_lock
>    scheduler[x] = csched2
>    pcpu_schedule_unlock(x) --> unlocks csched2_lock
>    rtds_free_pdata(x)
>      spin_lock(rtds_lock)
>      ASSERT(schedule_lock[x] == rtds_lock) [3]
>      schedule_lock[x] = DEFAULT_SCHEDULE_LOCK [4]
>      spin_unlock(rtds_lock)
> 
> So, the first problem is that, if anything related to
> scheduling, and involving CPU, happens at [1] or [2], we:
>  - take csched2_lock,
>  - operate on Credit1 functions and data structures,
> which is no good!
> 
> The second problem is that the ASSERT at [3] triggers, and
> the third that at [4], we screw up the lock remapping we've
> done for ourself in csched2_init_pdata()!
> 
> The first problem arises because there is a window during
> which the lock is already the new one, but the scheduler is
> still the old one. The other two, becase we let schedulers
> mess with the lock (re)mapping done by others.
> 
> This patch, therefore, introduces a new hook in the scheduler
> interface, called switch_sched, meant at being used when
> switching scheduler on a CPU, and implements it for the
> various schedulers, so that things are done in the proper
> order and under the protection of the best suited (set of)
> lock(s). It is necessary to add the hook (as compared to
> keep doing things in generic code), because different
> schedulers may have different locking schemes.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Thanks!

Reviewed-by: George Dunlap <george.dunlap@citrix.com>


> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Josh Whitehead <josh.whitehead@dornerworks.com>
> Cc: Robert VanVossen <robert.vanvossen@dornerworks.com>
> Cc: Meng Xu <mengxu@cis.upenn.edu>
> Cc: Tianyang Chen <tiche@seas.upenn.edu>
> ---
> Changes from v2:
>  - the hook is implemented in ARINC653 too, as necessary and
>    as requested during review.
> 
> Changes from v1:
> 
> new patch, basically, coming from squashing what were
> 4 patches in v1. In any case, with respect to those 4
> patches:
>  - runqueue lock is back being taken in schedule_cpu_switch(),
>    as suggested during review;
>  - add barriers for making sure all initialization is done
>    when the new lock is assigned, as sugested during review;
>  - add comments and ASSERT-s about how and why the adopted
>    locking scheme is safe, as suggested during review.
> ---
>  xen/common/sched_arinc653.c |   34 ++++++++++++++++++
>  xen/common/sched_credit.c   |   44 +++++++++++++++++++++++
>  xen/common/sched_credit2.c  |   81 ++++++++++++++++++++++++++++++++-----------
>  xen/common/sched_rt.c       |   45 +++++++++++++++++-------
>  xen/common/schedule.c       |   41 +++++++++++++++++-----
>  xen/include/xen/sched-if.h  |    3 ++
>  6 files changed, 206 insertions(+), 42 deletions(-)
> 
> diff --git a/xen/common/sched_arinc653.c b/xen/common/sched_arinc653.c
> index b79fcdf..ebd2090 100644
> --- a/xen/common/sched_arinc653.c
> +++ b/xen/common/sched_arinc653.c
> @@ -652,6 +652,38 @@ a653sched_pick_cpu(const struct scheduler *ops, struct vcpu *vc)
>  }
>  
>  /**
> + * Xen scheduler callback to change the scheduler of a cpu
> + *
> + * @param new_ops   Pointer to this instance of the scheduler structure
> + * @param cpu       The cpu that is changing scheduler
> + * @param pdata     scheduler specific PCPU data (we don't have any)
> + * @param vdata     scheduler specific VCPU data of the idle vcpu
> + */
> +static void
> +a653_switch_sched(struct scheduler *new_ops, unsigned int cpu,
> +                  void *pdata, void *vdata)
> +{
> +    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
> +    arinc653_vcpu_t *svc = vdata;
> +
> +    ASSERT(!pdata && svc && is_idle_vcpu(svc->vc));
> +
> +    idle_vcpu[cpu]->sched_priv = vdata;
> +
> +    per_cpu(scheduler, cpu) = new_ops;
> +    per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */
> +
> +    /*
> +     * (Re?)route the lock to its default location. We actually do not use
> +     * it, but if we leave it pointing to where it does now (i.e., the
> +     * runqueue lock for this PCPU in the default scheduler), we'd be
> +     * causing unnecessary contention on that lock (in cases where it is
> +     * shared among multiple PCPUs, like in Credit2 and RTDS).
> +     */
> +    sd->schedule_lock = &sd->_lock;
> +}
> +
> +/**
>   * Xen scheduler callback function to perform a global (not domain-specific)
>   * adjustment. It is used by the ARINC 653 scheduler to put in place a new
>   * ARINC 653 schedule or to retrieve the schedule currently in place.
> @@ -727,6 +759,8 @@ static const struct scheduler sched_arinc653_def = {
>  
>      .pick_cpu       = a653sched_pick_cpu,
>  
> +    .switch_sched   = a653_switch_sched,
> +
>      .adjust         = NULL,
>      .adjust_global  = a653sched_adjust_global,
>  
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index 96a245d..490f10b 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -578,12 +578,55 @@ csched_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
>  {
>      unsigned long flags;
>      struct csched_private *prv = CSCHED_PRIV(ops);
> +    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
> +
> +    /*
> +     * This is called either during during boot, resume or hotplug, in
> +     * case Credit1 is the scheduler chosen at boot. In such cases, the
> +     * scheduler lock for cpu is already pointing to the default per-cpu
> +     * spinlock, as Credit1 needs it, so there is no remapping to be done.
> +     */
> +    ASSERT(sd->schedule_lock == &sd->_lock && !spin_is_locked(&sd->_lock));
>  
>      spin_lock_irqsave(&prv->lock, flags);
>      init_pdata(prv, pdata, cpu);
>      spin_unlock_irqrestore(&prv->lock, flags);
>  }
>  
> +/* Change the scheduler of cpu to us (Credit). */
> +static void
> +csched_switch_sched(struct scheduler *new_ops, unsigned int cpu,
> +                    void *pdata, void *vdata)
> +{
> +    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
> +    struct csched_private *prv = CSCHED_PRIV(new_ops);
> +    struct csched_vcpu *svc = vdata;
> +
> +    ASSERT(svc && is_idle_vcpu(svc->vcpu));
> +
> +    idle_vcpu[cpu]->sched_priv = vdata;
> +
> +    /*
> +     * We are holding the runqueue lock already (it's been taken in
> +     * schedule_cpu_switch()). It actually may or may not be the 'right'
> +     * one for this cpu, but that is ok for preventing races.
> +     */
> +    spin_lock(&prv->lock);
> +    init_pdata(prv, pdata, cpu);
> +    spin_unlock(&prv->lock);
> +
> +    per_cpu(scheduler, cpu) = new_ops;
> +    per_cpu(schedule_data, cpu).sched_priv = pdata;
> +
> +    /*
> +     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
> +     * if it is free (and it can be) we want that anyone that manages
> +     * taking it, finds all the initializations we've done above in place.
> +     */
> +    smp_mb();
> +    sd->schedule_lock = &sd->_lock;
> +}
> +
>  #ifndef NDEBUG
>  static inline void
>  __csched_vcpu_check(struct vcpu *vc)
> @@ -2067,6 +2110,7 @@ static const struct scheduler sched_credit_def = {
>      .alloc_pdata    = csched_alloc_pdata,
>      .init_pdata     = csched_init_pdata,
>      .free_pdata     = csched_free_pdata,
> +    .switch_sched   = csched_switch_sched,
>      .alloc_domdata  = csched_alloc_domdata,
>      .free_domdata   = csched_free_domdata,
>  
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 8989eea..60c6f5b 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -1971,12 +1971,12 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
>      cpumask_clear_cpu(rqi, &prv->active_queues);
>  }
>  
> -static void
> +/* Returns the ID of the runqueue the cpu is assigned to. */
> +static unsigned
>  init_pdata(struct csched2_private *prv, unsigned int cpu)
>  {
>      unsigned rqi;
>      struct csched2_runqueue_data *rqd;
> -    spinlock_t *old_lock;
>  
>      ASSERT(spin_is_locked(&prv->lock));
>      ASSERT(!cpumask_test_cpu(cpu, &prv->initialized));
> @@ -2007,44 +2007,89 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
>          activate_runqueue(prv, rqi);
>      }
>      
> -    /* IRQs already disabled */
> -    old_lock = pcpu_schedule_lock(cpu);
> -
> -    /* Move spinlock to new runq lock.  */
> -    per_cpu(schedule_data, cpu).schedule_lock = &rqd->lock;
> -
>      /* Set the runqueue map */
>      prv->runq_map[cpu] = rqi;
>      
>      cpumask_set_cpu(cpu, &rqd->idle);
>      cpumask_set_cpu(cpu, &rqd->active);
> -
> -    /* _Not_ pcpu_schedule_unlock(): per_cpu().schedule_lock changed! */
> -    spin_unlock(old_lock);
> -
>      cpumask_set_cpu(cpu, &prv->initialized);
>  
> -    return;
> +    return rqi;
>  }
>  
>  static void
>  csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
>  {
>      struct csched2_private *prv = CSCHED2_PRIV(ops);
> +    spinlock_t *old_lock;
>      unsigned long flags;
> +    unsigned rqi;
>  
>      spin_lock_irqsave(&prv->lock, flags);
> -    init_pdata(prv, cpu);
> +    old_lock = pcpu_schedule_lock(cpu);
> +
> +    rqi = init_pdata(prv, cpu);
> +    /* Move the scheduler lock to the new runq lock. */
> +    per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock;
> +
> +    /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
> +    spin_unlock(old_lock);
>      spin_unlock_irqrestore(&prv->lock, flags);
>  }
>  
> +/* Change the scheduler of cpu to us (Credit2). */
> +static void
> +csched2_switch_sched(struct scheduler *new_ops, unsigned int cpu,
> +                     void *pdata, void *vdata)
> +{
> +    struct csched2_private *prv = CSCHED2_PRIV(new_ops);
> +    struct csched2_vcpu *svc = vdata;
> +    unsigned rqi;
> +
> +    ASSERT(!pdata && svc && is_idle_vcpu(svc->vcpu));
> +
> +    /*
> +     * We own one runqueue lock already (from schedule_cpu_switch()). This
> +     * looks like it violates this scheduler's locking rules, but it does
> +     * not, as what we own is the lock of another scheduler, that hence has
> +     * no particular (ordering) relationship with our private global lock.
> +     * And owning exactly that one (the lock of the old scheduler of this
> +     * cpu) is what is necessary to prevent races.
> +     */
> +    spin_lock_irq(&prv->lock);
> +
> +    idle_vcpu[cpu]->sched_priv = vdata;
> +
> +    rqi = init_pdata(prv, cpu);
> +
> +    /*
> +     * Now that we know what runqueue we'll go in, double check what's said
> +     * above: the lock we already hold is not the one of this runqueue of
> +     * this scheduler, and so it's safe to have taken it /before/ our
> +     * private global lock.
> +     */
> +    ASSERT(per_cpu(schedule_data, cpu).schedule_lock != &prv->rqd[rqi].lock);
> +
> +    per_cpu(scheduler, cpu) = new_ops;
> +    per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */
> +
> +    /*
> +     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
> +     * if it is free (and it can be) we want that anyone that manages
> +     * taking it, find all the initializations we've done above in place.
> +     */
> +    smp_mb();
> +    per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock;
> +
> +    spin_unlock_irq(&prv->lock);
> +}
> +
>  static void
>  csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>  {
>      unsigned long flags;
>      struct csched2_private *prv = CSCHED2_PRIV(ops);
>      struct csched2_runqueue_data *rqd;
> -    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
>      int rqi;
>  
>      spin_lock_irqsave(&prv->lock, flags);
> @@ -2072,11 +2117,6 @@ csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>          deactivate_runqueue(prv, rqi);
>      }
>  
> -    /* Move spinlock to the original lock.  */
> -    ASSERT(sd->schedule_lock == &rqd->lock);
> -    ASSERT(!spin_is_locked(&sd->_lock));
> -    sd->schedule_lock = &sd->_lock;
> -
>      spin_unlock(&rqd->lock);
>  
>      cpumask_clear_cpu(cpu, &prv->initialized);
> @@ -2170,6 +2210,7 @@ static const struct scheduler sched_credit2_def = {
>      .free_vdata     = csched2_free_vdata,
>      .init_pdata     = csched2_init_pdata,
>      .free_pdata     = csched2_free_pdata,
> +    .switch_sched   = csched2_switch_sched,
>      .alloc_domdata  = csched2_alloc_domdata,
>      .free_domdata   = csched2_free_domdata,
>  };
> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> index b96bd93..3bb8c71 100644
> --- a/xen/common/sched_rt.c
> +++ b/xen/common/sched_rt.c
> @@ -682,6 +682,37 @@ rt_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
>      spin_unlock_irqrestore(old_lock, flags);
>  }
>  
> +/* Change the scheduler of cpu to us (RTDS). */
> +static void
> +rt_switch_sched(struct scheduler *new_ops, unsigned int cpu,
> +                void *pdata, void *vdata)
> +{
> +    struct rt_private *prv = rt_priv(new_ops);
> +    struct rt_vcpu *svc = vdata;
> +
> +    ASSERT(!pdata && svc && is_idle_vcpu(svc->vcpu));
> +
> +    /*
> +     * We are holding the runqueue lock already (it's been taken in
> +     * schedule_cpu_switch()). It's actually the runqueue lock of
> +     * another scheduler, but that is how things need to be, for
> +     * preventing races.
> +     */
> +    ASSERT(per_cpu(schedule_data, cpu).schedule_lock != &prv->lock);
> +
> +    idle_vcpu[cpu]->sched_priv = vdata;
> +    per_cpu(scheduler, cpu) = new_ops;
> +    per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */
> +
> +    /*
> +     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
> +     * if it is free (and it can be) we want that anyone that manages
> +     * taking it, find all the initializations we've done above in place.
> +     */
> +    smp_mb();
> +    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
> +}
> +
>  static void *
>  rt_alloc_pdata(const struct scheduler *ops, int cpu)
>  {
> @@ -707,19 +738,6 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu)
>  static void
>  rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>  {
> -    struct rt_private *prv = rt_priv(ops);
> -    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
> -    unsigned long flags;
> -
> -    spin_lock_irqsave(&prv->lock, flags);
> -
> -    /* Move spinlock back to the default lock */
> -    ASSERT(sd->schedule_lock == &prv->lock);
> -    ASSERT(!spin_is_locked(&sd->_lock));
> -    sd->schedule_lock = &sd->_lock;
> -
> -    spin_unlock_irqrestore(&prv->lock, flags);
> -
>      free_cpumask_var(_cpumask_scratch[cpu]);
>  }
>  
> @@ -1468,6 +1486,7 @@ static const struct scheduler sched_rtds_def = {
>      .alloc_pdata    = rt_alloc_pdata,
>      .free_pdata     = rt_free_pdata,
>      .init_pdata     = rt_init_pdata,
> +    .switch_sched   = rt_switch_sched,
>      .alloc_domdata  = rt_alloc_domdata,
>      .free_domdata   = rt_free_domdata,
>      .init_domain    = rt_dom_init,
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 1941613..5559aa1 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -1635,11 +1635,11 @@ void __init scheduler_init(void)
>  int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>  {
>      struct vcpu *idle;
> -    spinlock_t *lock;
>      void *ppriv, *ppriv_old, *vpriv, *vpriv_old;
>      struct scheduler *old_ops = per_cpu(scheduler, cpu);
>      struct scheduler *new_ops = (c == NULL) ? &ops : c->sched;
>      struct cpupool *old_pool = per_cpu(cpupool, cpu);
> +    spinlock_t * old_lock;
>  
>      /*
>       * pCPUs only move from a valid cpupool to free (i.e., out of any pool),
> @@ -1658,11 +1658,21 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>      if ( old_ops == new_ops )
>          goto out;
>  
> +    /*
> +     * To setup the cpu for the new scheduler we need:
> +     *  - a valid instance of per-CPU scheduler specific data, as it is
> +     *    allocated by SCHED_OP(alloc_pdata). Note that we do not want to
> +     *    initialize it yet (i.e., we are not calling SCHED_OP(init_pdata)).
> +     *    That will be done by the target scheduler, in SCHED_OP(switch_sched),
> +     *    in proper ordering and with locking.
> +     *  - a valid instance of per-vCPU scheduler specific data, for the idle
> +     *    vCPU of cpu. That is what the target scheduler will use for the
> +     *    sched_priv field of the per-vCPU info of the idle domain.
> +     */
>      idle = idle_vcpu[cpu];
>      ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
>      if ( IS_ERR(ppriv) )
>          return PTR_ERR(ppriv);
> -    SCHED_OP(new_ops, init_pdata, ppriv, cpu);
>      vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv);
>      if ( vpriv == NULL )
>      {
> @@ -1670,17 +1680,30 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>          return -ENOMEM;
>      }
>  
> -    lock = pcpu_schedule_lock_irq(cpu);
> -
>      SCHED_OP(old_ops, tick_suspend, cpu);
> +
> +    /*
> +     * The actual switch, including (if necessary) the rerouting of the
> +     * scheduler lock to whatever new_ops prefers,  needs to happen in one
> +     * critical section, protected by old_ops' lock, or races are possible.
> +     * It is, in fact, the lock of another scheduler that we are taking (the
> +     * scheduler of the cpupool that cpu still belongs to). But that is ok
> +     * as, anyone trying to schedule on this cpu will spin until when we
> +     * release that lock (bottom of this function). When he'll get the lock
> +     * --thanks to the loop inside *_schedule_lock() functions-- he'll notice
> +     * that the lock itself changed, and retry acquiring the new one (which
> +     * will be the correct, remapped one, at that point).
> +     */
> +    old_lock = pcpu_schedule_lock(cpu);
> +
>      vpriv_old = idle->sched_priv;
> -    idle->sched_priv = vpriv;
> -    per_cpu(scheduler, cpu) = new_ops;
>      ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
> -    per_cpu(schedule_data, cpu).sched_priv = ppriv;
> -    SCHED_OP(new_ops, tick_resume, cpu);
> +    SCHED_OP(new_ops, switch_sched, cpu, ppriv, vpriv);
>  
> -    pcpu_schedule_unlock_irq(lock, cpu);
> +    /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
> +    spin_unlock_irq(old_lock);
> +
> +    SCHED_OP(new_ops, tick_resume, cpu);
>  
>      SCHED_OP(old_ops, free_vdata, vpriv_old);
>      SCHED_OP(old_ops, free_pdata, ppriv_old, cpu);
> diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
> index 70c08c6..9cebe41 100644
> --- a/xen/include/xen/sched-if.h
> +++ b/xen/include/xen/sched-if.h
> @@ -137,6 +137,9 @@ struct scheduler {
>      void         (*free_domdata)   (const struct scheduler *, void *);
>      void *       (*alloc_domdata)  (const struct scheduler *, struct domain *);
>  
> +    void         (*switch_sched)   (struct scheduler *, unsigned int,
> +                                    void *, void *);
> +
>      int          (*init_domain)    (const struct scheduler *, struct domain *);
>      void         (*destroy_domain) (const struct scheduler *, struct domain *);
>  
>
George Dunlap April 8, 2016, 1 p.m. UTC | #2
On 08/04/16 13:52, George Dunlap wrote:
> On 08/04/16 02:23, Dario Faggioli wrote:
>> In short, the point is making sure that the actual switch
>> of scheduler and the remapping of the scheduler's runqueue
>> lock occur in the same critical section, protected by the
>> "old" scheduler's lock (and not, e.g., in the free_pdata
>> hook, as it is now for Credit2 and RTDS).
>>
>> Not doing  so, is (at least) racy. In fact, for instance,
>> if we switch cpu X from, Credit2 to Credit, we do:
>>
>>  schedule_cpu_switch(x, csched2 --> csched):
>>    //scheduler[x] is csched2
>>    //schedule_lock[x] is csched2_lock
>>    csched_alloc_pdata(x)
>>    csched_init_pdata(x)
>>    pcpu_schedule_lock(x) ----> takes csched2_lock
>>    scheduler[X] = csched
>>    pcpu_schedule_unlock(x) --> unlocks csched2_lock
>>    [1]
>>    csched2_free_pdata(x)
>>      pcpu_schedule_lock(x) --> takes csched2_lock
>>      schedule_lock[x] = csched_lock
>>      spin_unlock(csched2_lock)
>>
>> While, if we switch cpu X from, Credit to Credit2, we do:
>>
>>  schedule_cpu_switch(X, csched --> csched2):
>>    //scheduler[x] is csched
>>    //schedule_lock[x] is csched_lock
>>    csched2_alloc_pdata(x)
>>    csched2_init_pdata(x)
>>      pcpu_schedule_lock(x) --> takes csched_lock
>>      schedule_lock[x] = csched2_lock
>>      spin_unlock(csched_lock)
>>    [2]
>>    pcpu_schedule_lock(x) ----> takes csched2_lock
>>    scheduler[X] = csched2
>>    pcpu_schedule_unlock(x) --> unlocks csched2_lock
>>    csched_free_pdata(x)
>>
>> And if we switch cpu X from RTDS to Credit2, we do:
>>
>>  schedule_cpu_switch(X, RTDS --> csched2):
>>    //scheduler[x] is rtds
>>    //schedule_lock[x] is rtds_lock
>>    csched2_alloc_pdata(x)
>>    csched2_init_pdata(x)
>>      pcpu_schedule_lock(x) --> takes rtds_lock
>>      schedule_lock[x] = csched2_lock
>>      spin_unlock(rtds_lock)
>>    pcpu_schedule_lock(x) ----> takes csched2_lock
>>    scheduler[x] = csched2
>>    pcpu_schedule_unlock(x) --> unlocks csched2_lock
>>    rtds_free_pdata(x)
>>      spin_lock(rtds_lock)
>>      ASSERT(schedule_lock[x] == rtds_lock) [3]
>>      schedule_lock[x] = DEFAULT_SCHEDULE_LOCK [4]
>>      spin_unlock(rtds_lock)
>>
>> So, the first problem is that, if anything related to
>> scheduling, and involving CPU, happens at [1] or [2], we:
>>  - take csched2_lock,
>>  - operate on Credit1 functions and data structures,
>> which is no good!
>>
>> The second problem is that the ASSERT at [3] triggers, and
>> the third that at [4], we screw up the lock remapping we've
>> done for ourself in csched2_init_pdata()!
>>
>> The first problem arises because there is a window during
>> which the lock is already the new one, but the scheduler is
>> still the old one. The other two, becase we let schedulers
>> mess with the lock (re)mapping done by others.
>>
>> This patch, therefore, introduces a new hook in the scheduler
>> interface, called switch_sched, meant at being used when
>> switching scheduler on a CPU, and implements it for the
>> various schedulers, so that things are done in the proper
>> order and under the protection of the best suited (set of)
>> lock(s). It is necessary to add the hook (as compared to
>> keep doing things in generic code), because different
>> schedulers may have different locking schemes.
>>
>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> Thanks!
> 
> Reviewed-by: George Dunlap <george.dunlap@citrix.com>

Committers:

Hopefully the arinc653 maintainers will get an opportunity to take a
look at this before the hard freeze today.  But if not, given the
timing, and the fact that the patch is really more to do with the
interface to the scheduling system as a whole rather than internal
algorithms of the arinc scheduler, I think it's probably OK to take the
liberty of checking it in even without an Ack (as long as there's no
Nack).  We can always revert / amend it later if there are objections.

 -George
Dario Faggioli April 8, 2016, 1:11 p.m. UTC | #3
On Fri, 2016-04-08 at 14:00 +0100, George Dunlap wrote:
> On 08/04/16 13:52, George Dunlap wrote:
> > On 08/04/16 02:23, Dario Faggioli wrote:
> > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> > Thanks!
> > 
> > Reviewed-by: George Dunlap <george.dunlap@citrix.com>
> Committers:
> 
> Hopefully the arinc653 maintainers will get an opportunity to take a
> look at this before the hard freeze today.  But if not, given the
> timing, and the fact that the patch is really more to do with the
> interface to the scheduling system as a whole rather than internal
> algorithms of the arinc scheduler, I think it's probably OK to take
> the
> liberty of checking it in even without an Ack (as long as there's no
> Nack).  We can always revert / amend it later if there are
> objections.
> 
Thanks George,

FWIW, I do agree, and I'm up for fixing any issue that could be raised,
or any bug that could surface in ARINC code because of this, super
quickly, during the rc period.

Now, a much easier but I guess technically relevant question: assuming
that you (George) also like and Ack patch 8, should I (and this is for
committers) resend the series, or do you guys can fetch it from the git
branch and/or list (with the correct version of patch 8 being the one
attached to the reply to Juergen's further comments)?

Thanks again and Regards,
Dario
Robert VanVossen April 8, 2016, 2 p.m. UTC | #4
On 4/8/2016 9:11 AM, Dario Faggioli wrote:
> On Fri, 2016-04-08 at 14:00 +0100, George Dunlap wrote:
>> On 08/04/16 13:52, George Dunlap wrote:
>>> On 08/04/16 02:23, Dario Faggioli wrote:
>>>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>>> Thanks!
>>>
>>> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
>> Committers:
>>
>> Hopefully the arinc653 maintainers will get an opportunity to take a
>> look at this before the hard freeze today.  But if not, given the
>> timing, and the fact that the patch is really more to do with the
>> interface to the scheduling system as a whole rather than internal
>> algorithms of the arinc scheduler, I think it's probably OK to take
>> the
>> liberty of checking it in even without an Ack (as long as there's no
>> Nack).  We can always revert / amend it later if there are
>> objections.

Acked-by: Robert VanVossen <robert.vanvossen@dornerworks.com>

Luckily, I had some time today.

Thanks,
Robbie VanVossen
Konrad Rzeszutek Wilk April 11, 2016, 2:43 p.m. UTC | #5
On Fri, Apr 08, 2016 at 03:11:23PM +0200, Dario Faggioli wrote:
> On Fri, 2016-04-08 at 14:00 +0100, George Dunlap wrote:
> > On 08/04/16 13:52, George Dunlap wrote:
> > > On 08/04/16 02:23, Dario Faggioli wrote:
> > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> > > Thanks!
> > > 
> > > Reviewed-by: George Dunlap <george.dunlap@citrix.com>
> > Committers:
> > 
> > Hopefully the arinc653 maintainers will get an opportunity to take a
> > look at this before the hard freeze today.  But if not, given the
> > timing, and the fact that the patch is really more to do with the
> > interface to the scheduling system as a whole rather than internal
> > algorithms of the arinc scheduler, I think it's probably OK to take
> > the
> > liberty of checking it in even without an Ack (as long as there's no
> > Nack).  We can always revert / amend it later if there are
> > objections.
> > 
> Thanks George,
> 
> FWIW, I do agree, and I'm up for fixing any issue that could be raised,
> or any bug that could surface in ARINC code because of this, super
> quickly, during the rc period.
> 
> Now, a much easier but I guess technically relevant question: assuming
> that you (George) also like and Ack patch 8, should I (and this is for
> committers) resend the series, or do you guys can fetch it from the git
> branch and/or list (with the correct version of patch 8 being the one
> attached to the reply to Juergen's further comments)?

It is easier (at least for me) if I do less work.

That means if the patches have all the Acked-by and such without me having
to do it - then that is far easier.

And it is far easier (probably?) for you to create an branch called 'for-staging'
which has the patches that have been Acked- and are stripped of the ---.

So I can just do 'git pull' and all of them are in.

But other committers may prefer doing it differently (via git am or such)
in which case resending the series may be simpler?

But then today it looks like its just me and Ian doing commits - so
if you can get Wei's OK , just poke me on IRC with the branch name
and I will slurp it up.
diff mbox

Patch

diff --git a/xen/common/sched_arinc653.c b/xen/common/sched_arinc653.c
index b79fcdf..ebd2090 100644
--- a/xen/common/sched_arinc653.c
+++ b/xen/common/sched_arinc653.c
@@ -652,6 +652,38 @@  a653sched_pick_cpu(const struct scheduler *ops, struct vcpu *vc)
 }
 
 /**
+ * Xen scheduler callback to change the scheduler of a cpu
+ *
+ * @param new_ops   Pointer to this instance of the scheduler structure
+ * @param cpu       The cpu that is changing scheduler
+ * @param pdata     scheduler specific PCPU data (we don't have any)
+ * @param vdata     scheduler specific VCPU data of the idle vcpu
+ */
+static void
+a653_switch_sched(struct scheduler *new_ops, unsigned int cpu,
+                  void *pdata, void *vdata)
+{
+    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
+    arinc653_vcpu_t *svc = vdata;
+
+    ASSERT(!pdata && svc && is_idle_vcpu(svc->vc));
+
+    idle_vcpu[cpu]->sched_priv = vdata;
+
+    per_cpu(scheduler, cpu) = new_ops;
+    per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */
+
+    /*
+     * (Re?)route the lock to its default location. We actually do not use
+     * it, but if we leave it pointing to where it does now (i.e., the
+     * runqueue lock for this PCPU in the default scheduler), we'd be
+     * causing unnecessary contention on that lock (in cases where it is
+     * shared among multiple PCPUs, like in Credit2 and RTDS).
+     */
+    sd->schedule_lock = &sd->_lock;
+}
+
+/**
  * Xen scheduler callback function to perform a global (not domain-specific)
  * adjustment. It is used by the ARINC 653 scheduler to put in place a new
  * ARINC 653 schedule or to retrieve the schedule currently in place.
@@ -727,6 +759,8 @@  static const struct scheduler sched_arinc653_def = {
 
     .pick_cpu       = a653sched_pick_cpu,
 
+    .switch_sched   = a653_switch_sched,
+
     .adjust         = NULL,
     .adjust_global  = a653sched_adjust_global,
 
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 96a245d..490f10b 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -578,12 +578,55 @@  csched_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
 {
     unsigned long flags;
     struct csched_private *prv = CSCHED_PRIV(ops);
+    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
+
+    /*
+     * This is called either during during boot, resume or hotplug, in
+     * case Credit1 is the scheduler chosen at boot. In such cases, the
+     * scheduler lock for cpu is already pointing to the default per-cpu
+     * spinlock, as Credit1 needs it, so there is no remapping to be done.
+     */
+    ASSERT(sd->schedule_lock == &sd->_lock && !spin_is_locked(&sd->_lock));
 
     spin_lock_irqsave(&prv->lock, flags);
     init_pdata(prv, pdata, cpu);
     spin_unlock_irqrestore(&prv->lock, flags);
 }
 
+/* Change the scheduler of cpu to us (Credit). */
+static void
+csched_switch_sched(struct scheduler *new_ops, unsigned int cpu,
+                    void *pdata, void *vdata)
+{
+    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
+    struct csched_private *prv = CSCHED_PRIV(new_ops);
+    struct csched_vcpu *svc = vdata;
+
+    ASSERT(svc && is_idle_vcpu(svc->vcpu));
+
+    idle_vcpu[cpu]->sched_priv = vdata;
+
+    /*
+     * We are holding the runqueue lock already (it's been taken in
+     * schedule_cpu_switch()). It actually may or may not be the 'right'
+     * one for this cpu, but that is ok for preventing races.
+     */
+    spin_lock(&prv->lock);
+    init_pdata(prv, pdata, cpu);
+    spin_unlock(&prv->lock);
+
+    per_cpu(scheduler, cpu) = new_ops;
+    per_cpu(schedule_data, cpu).sched_priv = pdata;
+
+    /*
+     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
+     * if it is free (and it can be) we want that anyone that manages
+     * taking it, finds all the initializations we've done above in place.
+     */
+    smp_mb();
+    sd->schedule_lock = &sd->_lock;
+}
+
 #ifndef NDEBUG
 static inline void
 __csched_vcpu_check(struct vcpu *vc)
@@ -2067,6 +2110,7 @@  static const struct scheduler sched_credit_def = {
     .alloc_pdata    = csched_alloc_pdata,
     .init_pdata     = csched_init_pdata,
     .free_pdata     = csched_free_pdata,
+    .switch_sched   = csched_switch_sched,
     .alloc_domdata  = csched_alloc_domdata,
     .free_domdata   = csched_free_domdata,
 
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 8989eea..60c6f5b 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1971,12 +1971,12 @@  static void deactivate_runqueue(struct csched2_private *prv, int rqi)
     cpumask_clear_cpu(rqi, &prv->active_queues);
 }
 
-static void
+/* Returns the ID of the runqueue the cpu is assigned to. */
+static unsigned
 init_pdata(struct csched2_private *prv, unsigned int cpu)
 {
     unsigned rqi;
     struct csched2_runqueue_data *rqd;
-    spinlock_t *old_lock;
 
     ASSERT(spin_is_locked(&prv->lock));
     ASSERT(!cpumask_test_cpu(cpu, &prv->initialized));
@@ -2007,44 +2007,89 @@  init_pdata(struct csched2_private *prv, unsigned int cpu)
         activate_runqueue(prv, rqi);
     }
     
-    /* IRQs already disabled */
-    old_lock = pcpu_schedule_lock(cpu);
-
-    /* Move spinlock to new runq lock.  */
-    per_cpu(schedule_data, cpu).schedule_lock = &rqd->lock;
-
     /* Set the runqueue map */
     prv->runq_map[cpu] = rqi;
     
     cpumask_set_cpu(cpu, &rqd->idle);
     cpumask_set_cpu(cpu, &rqd->active);
-
-    /* _Not_ pcpu_schedule_unlock(): per_cpu().schedule_lock changed! */
-    spin_unlock(old_lock);
-
     cpumask_set_cpu(cpu, &prv->initialized);
 
-    return;
+    return rqi;
 }
 
 static void
 csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
 {
     struct csched2_private *prv = CSCHED2_PRIV(ops);
+    spinlock_t *old_lock;
     unsigned long flags;
+    unsigned rqi;
 
     spin_lock_irqsave(&prv->lock, flags);
-    init_pdata(prv, cpu);
+    old_lock = pcpu_schedule_lock(cpu);
+
+    rqi = init_pdata(prv, cpu);
+    /* Move the scheduler lock to the new runq lock. */
+    per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock;
+
+    /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
+    spin_unlock(old_lock);
     spin_unlock_irqrestore(&prv->lock, flags);
 }
 
+/* Change the scheduler of cpu to us (Credit2). */
+static void
+csched2_switch_sched(struct scheduler *new_ops, unsigned int cpu,
+                     void *pdata, void *vdata)
+{
+    struct csched2_private *prv = CSCHED2_PRIV(new_ops);
+    struct csched2_vcpu *svc = vdata;
+    unsigned rqi;
+
+    ASSERT(!pdata && svc && is_idle_vcpu(svc->vcpu));
+
+    /*
+     * We own one runqueue lock already (from schedule_cpu_switch()). This
+     * looks like it violates this scheduler's locking rules, but it does
+     * not, as what we own is the lock of another scheduler, that hence has
+     * no particular (ordering) relationship with our private global lock.
+     * And owning exactly that one (the lock of the old scheduler of this
+     * cpu) is what is necessary to prevent races.
+     */
+    spin_lock_irq(&prv->lock);
+
+    idle_vcpu[cpu]->sched_priv = vdata;
+
+    rqi = init_pdata(prv, cpu);
+
+    /*
+     * Now that we know what runqueue we'll go in, double check what's said
+     * above: the lock we already hold is not the one of this runqueue of
+     * this scheduler, and so it's safe to have taken it /before/ our
+     * private global lock.
+     */
+    ASSERT(per_cpu(schedule_data, cpu).schedule_lock != &prv->rqd[rqi].lock);
+
+    per_cpu(scheduler, cpu) = new_ops;
+    per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */
+
+    /*
+     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
+     * if it is free (and it can be) we want that anyone that manages
+     * taking it, find all the initializations we've done above in place.
+     */
+    smp_mb();
+    per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock;
+
+    spin_unlock_irq(&prv->lock);
+}
+
 static void
 csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 {
     unsigned long flags;
     struct csched2_private *prv = CSCHED2_PRIV(ops);
     struct csched2_runqueue_data *rqd;
-    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
     int rqi;
 
     spin_lock_irqsave(&prv->lock, flags);
@@ -2072,11 +2117,6 @@  csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
         deactivate_runqueue(prv, rqi);
     }
 
-    /* Move spinlock to the original lock.  */
-    ASSERT(sd->schedule_lock == &rqd->lock);
-    ASSERT(!spin_is_locked(&sd->_lock));
-    sd->schedule_lock = &sd->_lock;
-
     spin_unlock(&rqd->lock);
 
     cpumask_clear_cpu(cpu, &prv->initialized);
@@ -2170,6 +2210,7 @@  static const struct scheduler sched_credit2_def = {
     .free_vdata     = csched2_free_vdata,
     .init_pdata     = csched2_init_pdata,
     .free_pdata     = csched2_free_pdata,
+    .switch_sched   = csched2_switch_sched,
     .alloc_domdata  = csched2_alloc_domdata,
     .free_domdata   = csched2_free_domdata,
 };
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index b96bd93..3bb8c71 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -682,6 +682,37 @@  rt_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
     spin_unlock_irqrestore(old_lock, flags);
 }
 
+/* Change the scheduler of cpu to us (RTDS). */
+static void
+rt_switch_sched(struct scheduler *new_ops, unsigned int cpu,
+                void *pdata, void *vdata)
+{
+    struct rt_private *prv = rt_priv(new_ops);
+    struct rt_vcpu *svc = vdata;
+
+    ASSERT(!pdata && svc && is_idle_vcpu(svc->vcpu));
+
+    /*
+     * We are holding the runqueue lock already (it's been taken in
+     * schedule_cpu_switch()). It's actually the runqueue lock of
+     * another scheduler, but that is how things need to be, for
+     * preventing races.
+     */
+    ASSERT(per_cpu(schedule_data, cpu).schedule_lock != &prv->lock);
+
+    idle_vcpu[cpu]->sched_priv = vdata;
+    per_cpu(scheduler, cpu) = new_ops;
+    per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */
+
+    /*
+     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
+     * if it is free (and it can be) we want that anyone that manages
+     * taking it, find all the initializations we've done above in place.
+     */
+    smp_mb();
+    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
+}
+
 static void *
 rt_alloc_pdata(const struct scheduler *ops, int cpu)
 {
@@ -707,19 +738,6 @@  rt_alloc_pdata(const struct scheduler *ops, int cpu)
 static void
 rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 {
-    struct rt_private *prv = rt_priv(ops);
-    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
-    unsigned long flags;
-
-    spin_lock_irqsave(&prv->lock, flags);
-
-    /* Move spinlock back to the default lock */
-    ASSERT(sd->schedule_lock == &prv->lock);
-    ASSERT(!spin_is_locked(&sd->_lock));
-    sd->schedule_lock = &sd->_lock;
-
-    spin_unlock_irqrestore(&prv->lock, flags);
-
     free_cpumask_var(_cpumask_scratch[cpu]);
 }
 
@@ -1468,6 +1486,7 @@  static const struct scheduler sched_rtds_def = {
     .alloc_pdata    = rt_alloc_pdata,
     .free_pdata     = rt_free_pdata,
     .init_pdata     = rt_init_pdata,
+    .switch_sched   = rt_switch_sched,
     .alloc_domdata  = rt_alloc_domdata,
     .free_domdata   = rt_free_domdata,
     .init_domain    = rt_dom_init,
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 1941613..5559aa1 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -1635,11 +1635,11 @@  void __init scheduler_init(void)
 int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
 {
     struct vcpu *idle;
-    spinlock_t *lock;
     void *ppriv, *ppriv_old, *vpriv, *vpriv_old;
     struct scheduler *old_ops = per_cpu(scheduler, cpu);
     struct scheduler *new_ops = (c == NULL) ? &ops : c->sched;
     struct cpupool *old_pool = per_cpu(cpupool, cpu);
+    spinlock_t * old_lock;
 
     /*
      * pCPUs only move from a valid cpupool to free (i.e., out of any pool),
@@ -1658,11 +1658,21 @@  int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
     if ( old_ops == new_ops )
         goto out;
 
+    /*
+     * To setup the cpu for the new scheduler we need:
+     *  - a valid instance of per-CPU scheduler specific data, as it is
+     *    allocated by SCHED_OP(alloc_pdata). Note that we do not want to
+     *    initialize it yet (i.e., we are not calling SCHED_OP(init_pdata)).
+     *    That will be done by the target scheduler, in SCHED_OP(switch_sched),
+     *    in proper ordering and with locking.
+     *  - a valid instance of per-vCPU scheduler specific data, for the idle
+     *    vCPU of cpu. That is what the target scheduler will use for the
+     *    sched_priv field of the per-vCPU info of the idle domain.
+     */
     idle = idle_vcpu[cpu];
     ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
     if ( IS_ERR(ppriv) )
         return PTR_ERR(ppriv);
-    SCHED_OP(new_ops, init_pdata, ppriv, cpu);
     vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv);
     if ( vpriv == NULL )
     {
@@ -1670,17 +1680,30 @@  int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
         return -ENOMEM;
     }
 
-    lock = pcpu_schedule_lock_irq(cpu);
-
     SCHED_OP(old_ops, tick_suspend, cpu);
+
+    /*
+     * The actual switch, including (if necessary) the rerouting of the
+     * scheduler lock to whatever new_ops prefers,  needs to happen in one
+     * critical section, protected by old_ops' lock, or races are possible.
+     * It is, in fact, the lock of another scheduler that we are taking (the
+     * scheduler of the cpupool that cpu still belongs to). But that is ok
+     * as, anyone trying to schedule on this cpu will spin until when we
+     * release that lock (bottom of this function). When he'll get the lock
+     * --thanks to the loop inside *_schedule_lock() functions-- he'll notice
+     * that the lock itself changed, and retry acquiring the new one (which
+     * will be the correct, remapped one, at that point).
+     */
+    old_lock = pcpu_schedule_lock(cpu);
+
     vpriv_old = idle->sched_priv;
-    idle->sched_priv = vpriv;
-    per_cpu(scheduler, cpu) = new_ops;
     ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
-    per_cpu(schedule_data, cpu).sched_priv = ppriv;
-    SCHED_OP(new_ops, tick_resume, cpu);
+    SCHED_OP(new_ops, switch_sched, cpu, ppriv, vpriv);
 
-    pcpu_schedule_unlock_irq(lock, cpu);
+    /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
+    spin_unlock_irq(old_lock);
+
+    SCHED_OP(new_ops, tick_resume, cpu);
 
     SCHED_OP(old_ops, free_vdata, vpriv_old);
     SCHED_OP(old_ops, free_pdata, ppriv_old, cpu);
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 70c08c6..9cebe41 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -137,6 +137,9 @@  struct scheduler {
     void         (*free_domdata)   (const struct scheduler *, void *);
     void *       (*alloc_domdata)  (const struct scheduler *, struct domain *);
 
+    void         (*switch_sched)   (struct scheduler *, unsigned int,
+                                    void *, void *);
+
     int          (*init_domain)    (const struct scheduler *, struct domain *);
     void         (*destroy_domain) (const struct scheduler *, struct domain *);