Message ID | 20160318190505.8117.89778.stgit@Solace.station (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Mar 18, 2016 at 3:05 PM, Dario Faggioli <dario.faggioli@citrix.com> wrote: > by using the sched_switch hook that we have introduced in > the various schedulers. > > The key is to let the actual switch of scheduler and the > remapping of the scheduler lock for the CPU (if necessary) > happen together (in the same critical section) protected > (at least) by the old scheduler lock for the CPU. > > This also means that, in Credit2 and RTDS, we can get rid > of the code that was doing the scheduler lock remapping > in csched2_free_pdata() and rt_free_pdata(), and of their > triggering ASSERT-s. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > --- > Cc: George Dunlap <george.dunlap@eu.citrix.com> > Cc: Meng Xu <mengxu@cis.upenn.edu> > Cc: Tianyang Chen <tiche@seas.upenn.edu> > --- > xen/common/sched_credit.c | 9 +++++++++ > xen/common/sched_credit2.c | 28 ++++++++++------------------ > xen/common/sched_rt.c | 13 ------------- > xen/common/schedule.c | 30 +++++++++++++++++++++--------- For the sched_rt.c and schedule.c, Reviewed-by: Meng Xu <mengxu@cis.upenn.edu> ----------- Meng Xu PhD Student in Computer and Information Science University of Pennsylvania http://www.cis.upenn.edu/~mengxu/
On 18/03/16 19:05, Dario Faggioli wrote: > by using the sched_switch hook that we have introduced in > the various schedulers. > > The key is to let the actual switch of scheduler and the > remapping of the scheduler lock for the CPU (if necessary) > happen together (in the same critical section) protected > (at least) by the old scheduler lock for the CPU. > > This also means that, in Credit2 and RTDS, we can get rid > of the code that was doing the scheduler lock remapping > in csched2_free_pdata() and rt_free_pdata(), and of their > triggering ASSERT-s. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Similar to my comment before -- in my own tree I squashed patches 6-9 into a single commit and found it much easier to review. :-) One important question... > diff --git a/xen/common/schedule.c b/xen/common/schedule.c > index 1adc0e2..29582a6 100644 > --- a/xen/common/schedule.c > +++ b/xen/common/schedule.c > @@ -1617,7 +1617,6 @@ void __init scheduler_init(void) > int schedule_cpu_switch(unsigned int cpu, struct cpupool *c) > { > struct vcpu *idle; > - spinlock_t *lock; > void *ppriv, *ppriv_old, *vpriv, *vpriv_old; > struct scheduler *old_ops = per_cpu(scheduler, cpu); > struct scheduler *new_ops = (c == NULL) ? &ops : c->sched; > @@ -1640,11 +1639,21 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c) > if ( old_ops == new_ops ) > goto out; > > + /* > + * To setup the cpu for the new scheduler we need: > + * - a valid instance of per-CPU scheduler specific data, as it is > + * allocated by SCHED_OP(alloc_pdata). Note that we do not want to > + * initialize it yet (i.e., we are not calling SCHED_OP(init_pdata)). > + * That will be done by the target scheduler, in SCHED_OP(switch_sched), > + * in proper ordering and with locking. > + * - a valid instance of per-vCPU scheduler specific data, for the idle > + * vCPU of cpu. That is what the target scheduler will use for the > + * sched_priv field of the per-vCPU info of the idle domain. > + */ > idle = idle_vcpu[cpu]; > ppriv = SCHED_OP(new_ops, alloc_pdata, cpu); > if ( IS_ERR(ppriv) ) > return PTR_ERR(ppriv); > - SCHED_OP(new_ops, init_pdata, ppriv, cpu); > vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv); > if ( vpriv == NULL ) > { > @@ -1652,17 +1661,20 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c) > return -ENOMEM; > } > > - lock = pcpu_schedule_lock_irq(cpu); > - > SCHED_OP(old_ops, tick_suspend, cpu); > + > + /* > + * The actual switch, including (if necessary) the rerouting of the > + * scheduler lock to whatever new_ops prefers, needs to happen in one > + * critical section, protected by old_ops' lock, or races are possible. > + * Since each scheduler has its own contraints and locking scheme, do > + * that inside specific scheduler code, rather than here. > + */ > vpriv_old = idle->sched_priv; > - idle->sched_priv = vpriv; > - per_cpu(scheduler, cpu) = new_ops; > ppriv_old = per_cpu(schedule_data, cpu).sched_priv; > - per_cpu(schedule_data, cpu).sched_priv = ppriv; > - SCHED_OP(new_ops, tick_resume, cpu); > + SCHED_OP(new_ops, switch_sched, cpu, ppriv, vpriv); Is it really safe to read sched_priv without the lock held? -George
On 18/03/16 19:05, Dario Faggioli wrote: > by using the sched_switch hook that we have introduced in > the various schedulers. > > The key is to let the actual switch of scheduler and the > remapping of the scheduler lock for the CPU (if necessary) > happen together (in the same critical section) protected > (at least) by the old scheduler lock for the CPU. Thanks for trying to sort this out -- I've been looking this since yesterday afternoon and it certainly makes my head hurt. :-) It looks like you want to do the locking inside the sched_switch() callback, rather than outside of it, so that you can get the locking order right (global private before per-cpu scheduler lock). Otherwise you could just have schedule_cpu_switch grab and release the lock, and let the sched_switch() callback set the lock as needed (knowing that the correct lock is already held and will be released). But the ordering between prv->lock and the scheduler lock only needs to be between the prv lock and scheduler lock *of a specific instance* of the credit2 scheduler -- i.e., between prv->lock and prv->rqd[].lock. And, critically, if we're calling sched_switch, then we already know that the current pcpu lock is *not* one of the prv->rqd[].lock's because we check that at the top of schedule_cpu_switch(). So I think there should be no problem with: 1. Grabbing the pcpu schedule lock in schedule_cpu_switch() 2. Grabbing prv->lock in csched2_switch_sched() 3. Setting the per_cpu schedule lock as the very last thing in csched2_switch_sched() 4. Releasing the (old) pcpu schedule lock in schedule_cpu_switch(). What do you think? That would allow us to read ppriv_old and vpriv_old with the schedule_lock held. Unfortunately I can't off the top of my head think of a good assertion to put in at #2 to assert that the per-pcpu lock is *not* one of runqueue locks in prv, because we don't yet know which runqueue this cpu will be assigned to. But we could check when we actually do the lock assignment to make sure that it's not already equal. That way we'll either deadlock or ASSERT (which is not as good as always ASSERTing, but is better than either deadlocking or working fine). As an aside -- it seems to me that as soon as we change the scheduler lock, there's a risk that something else may come along and try to grab it / access the data. Does that mean we really ought to use memory barriers to make sure that the lock is written only after all changes to the scheduler data have been appropriately made? > This also means that, in Credit2 and RTDS, we can get rid > of the code that was doing the scheduler lock remapping > in csched2_free_pdata() and rt_free_pdata(), and of their > triggering ASSERT-s. Right -- so to put it a different way, *all* schedulers must now set the locking scheme they wish to use, even if they want to use the default per-cpu locks. I think that means we have to do that for arinc653 too, right? At first I thought we could look at having schedule_cpu_switch() always reset the lock before calling the switch_sched() callback; but if my comment about memory barriers is accurate, then that won't work either. In any case, there are only 4 schedulers, so it's not that hard to just have them all set the locking scheme they want. -George
On Wed, 2016-03-23 at 19:05 +0000, George Dunlap wrote: > On 18/03/16 19:05, Dario Faggioli wrote: > > > > by using the sched_switch hook that we have introduced in > > the various schedulers. > > > > The key is to let the actual switch of scheduler and the > > remapping of the scheduler lock for the CPU (if necessary) > > happen together (in the same critical section) protected > > (at least) by the old scheduler lock for the CPU. > > > > This also means that, in Credit2 and RTDS, we can get rid > > of the code that was doing the scheduler lock remapping > > in csched2_free_pdata() and rt_free_pdata(), and of their > > triggering ASSERT-s. > > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Similar to my comment before -- in my own tree I squashed patches 6-9 > into a single commit and found it much easier to review. :-) > I understand your point. I'll consider doing something like this for v2 (that I'm just finishing putting together), but I'm not sure I like it. For instance, although the issue has the same roots and similar consequences for all schedulers, the actual race is different between Credit1 and Credit2 (RTDS is the same as Credit2), and having distinct patches for each scheduler allows me to describe both the situations in details, in their respective changelog, without the changelogs themselves becoming too long (they're actually quite long already!!). > One important question... > > > > --- a/xen/common/schedule.c > > +++ b/xen/common/schedule.c > > > > @@ -1652,17 +1661,20 @@ int schedule_cpu_switch(unsigned int cpu, > > struct cpupool *c) > > return -ENOMEM; > > } > > > > - lock = pcpu_schedule_lock_irq(cpu); > > - > > SCHED_OP(old_ops, tick_suspend, cpu); > > + > > + /* > > + * The actual switch, including (if necessary) the rerouting > > of the > > + * scheduler lock to whatever new_ops prefers, needs to > > happen in one > > + * critical section, protected by old_ops' lock, or races are > > possible. > > + * Since each scheduler has its own contraints and locking > > scheme, do > > + * that inside specific scheduler code, rather than here. > > + */ > > vpriv_old = idle->sched_priv; > > - idle->sched_priv = vpriv; > > - per_cpu(scheduler, cpu) = new_ops; > > ppriv_old = per_cpu(schedule_data, cpu).sched_priv; > > - per_cpu(schedule_data, cpu).sched_priv = ppriv; > > - SCHED_OP(new_ops, tick_resume, cpu); > > + SCHED_OP(new_ops, switch_sched, cpu, ppriv, vpriv); > Is it really safe to read sched_priv without the lock held? > So, you put down a lot more reasoning on this issue in another email, and I'll reply in more length to that one. But just about this specific thing. We're in schedule_cpu_switch(), and schedule_cpu_switch() is indeed the only function that changes the content of sd->sched_priv, when the system is _live_. It both reads the old pointer, stash it, allocate the new one, assign it, and free the old one. It's therefore only because of this function that a race can happen. In fact, the only other situation where sched_priv changes is during cpu bringup (CPU_UP_PREPARE phase), or teardown. But those cases are not of much concern (and, in fact, there's no locking in there, independently from this series). Now, schedule_cpu_switch is called by: 1 cpupool.c cpupool_assign_cpu_locked 268 ret = schedule_cpu_switch(cpu, c); 2 cpupool.c cpupool_unassign_cpu_helper 325 ret = schedule_cpu_switch(cpu, NULL); to move the cpu inside or outside from a cpupool, and in both cases, we have taken the cpupool_lock spinlock already when calling it. So, yes, it looks to me that sched_priv is safe to be manipulated like the patch is doing... Am I overlooking something? Thanks and Regards, Dario
On Thu, 2016-03-24 at 12:14 +0000, George Dunlap wrote: > On 18/03/16 19:05, Dario Faggioli wrote: > > > > by using the sched_switch hook that we have introduced in > > the various schedulers. > > > > The key is to let the actual switch of scheduler and the > > remapping of the scheduler lock for the CPU (if necessary) > > happen together (in the same critical section) protected > > (at least) by the old scheduler lock for the CPU. > Thanks for trying to sort this out -- I've been looking this since > yesterday afternoon and it certainly makes my head hurt. :-) > > It looks like you want to do the locking inside the sched_switch() > callback, rather than outside of it, so that you can get the locking > order right (global private before per-cpu scheduler lock). > Yes, especially considering that such locking order varies with the scheduler. In fact, in Credit1, it's per-cpu first, global second; in Credit2, it's the other way round: global first, per-runq second. In RTDS there is only one global lock and in ARINC only the per-cpu locks. > Otherwise > you could just have schedule_cpu_switch grab and release the lock, > and > let the sched_switch() callback set the lock as needed (knowing that > the > correct lock is already held and will be released). > Yeah, that's after all what is being done even right now. Of course, right now it's buggy, but even if it probably can be made to work, in the way you suggest, I thought bringing anything that has to do with the locking order required by a specific scheduler _inside_ the specific scheduler code would be a good thing. I actually still think that, but I recognize the ugliness of accessing ppriv_old and vpriv_old outside of scheduler locks (although, as I said in the other email, it should be safe, as switching scheduler is serialized by cpupool_lock anyway... perhaps I should put this in a comment), not to mention how ugly it is the interface provided by sched_switch() (talking about the last two parameters :-/). > But the ordering between prv->lock and the scheduler lock only needs > to > be between the prv lock and scheduler lock *of a specific instance* > of > the credit2 scheduler -- i.e., between prv->lock and prv->rqd[].lock. > True... > And, critically, if we're calling sched_switch, then we already know > that the current pcpu lock is *not* one of the prv->rqd[].lock's > because > we check that at the top of schedule_cpu_switch(). > ... True again... > So I think there should be no problem with: > 1. Grabbing the pcpu schedule lock in schedule_cpu_switch() > 2. Grabbing prv->lock in csched2_switch_sched() > 3. Setting the per_cpu schedule lock as the very last thing in > csched2_switch_sched() > 4. Releasing the (old) pcpu schedule lock in schedule_cpu_switch(). > > What do you think? > I think it should work. We'll be doing the scheduler lock manipulation protected by the old and wrong (as it's the one of another scheduler) per-cpu/runq lock, and the correct global private lock. It would look like the ordering between the two locks is the wrong one, in Credit2, but it's not because of the fact that the per-runq lock is the other scheduler's one. Tricky, but everything is in here! :-/ I'm not sure whether it's better or worse wrt what I have. I'm trying coding it up, to see how it looks like (and if it allows me to get rid of the ugliness in sched_switch()). > That would allow us to read ppriv_old and vpriv_old with the > schedule_lock held. > Also good. :-) > Unfortunately I can't off the top of my head think of a good > assertion > to put in at #2 to assert that the per-pcpu lock is *not* one of > runqueue locks in prv, because we don't yet know which runqueue this > cpu > will be assigned to. But we could check when we actually do the lock > assignment to make sure that it's not already equal. That way we'll > either deadlock or ASSERT (which is not as good as always ASSERTing, > but > is better than either deadlocking or working fine). > Yep, I don't think we can do much better than that. > As an aside -- it seems to me that as soon as we change the scheduler > lock, there's a risk that something else may come along and try to > grab > it / access the data. Does that mean we really ought to use memory > barriers to make sure that the lock is written only after all changes > to > the scheduler data have been appropriately made? > Yes, if it were only for this code, I think you're right, barriers would be necessary. I once again think this is actually safe, because it's serialized elsewhere, but thinking more about it, I can well add both barriers (and a comment). > > > > This also means that, in Credit2 and RTDS, we can get rid > > of the code that was doing the scheduler lock remapping > > in csched2_free_pdata() and rt_free_pdata(), and of their > > triggering ASSERT-s. > Right -- so to put it a different way, *all* schedulers must now set > the > locking scheme they wish to use, even if they want to use the default > per-cpu locks. > Exactly. > I think that means we have to do that for arinc653 too, > right? > Mmm... right, I'll have a look at that. Thanks for looking at this, and sorry for the late reply. Regards, Dario
On Tue, 2016-04-05 at 18:26 +0200, Dario Faggioli wrote: > On Wed, 2016-03-23 at 19:05 +0000, George Dunlap wrote: > > On 18/03/16 19:05, Dario Faggioli wrote: > > > This also means that, in Credit2 and RTDS, we can get rid > > > of the code that was doing the scheduler lock remapping > > > in csched2_free_pdata() and rt_free_pdata(), and of their > > > triggering ASSERT-s. > > > > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Similar to my comment before -- in my own tree I squashed patches > > 6-9 > > into a single commit and found it much easier to review. :-) > > > I understand your point. > > I'll consider doing something like this for v2 (that I'm just > finishing > putting together), but I'm not sure I like it. > BTW, I changed my mind and, in the patch series I'm about to send, I did as you suggest and squashed these patches into one. :-) The changelog is indeed rather long, but still fine, IMO, and there's a lot less duplication, both in `git log' and in code (if one looks at it 'patches after patches', rather than just the final result). Thanks and Regards, Dario
On Tue, 2016-04-05 at 19:37 +0200, Dario Faggioli wrote: > On Thu, 2016-03-24 at 12:14 +0000, George Dunlap wrote: > > On 18/03/16 19:05, Dario Faggioli wrote: > > So I think there should be no problem with: > > 1. Grabbing the pcpu schedule lock in schedule_cpu_switch() > > 2. Grabbing prv->lock in csched2_switch_sched() > > 3. Setting the per_cpu schedule lock as the very last thing in > > csched2_switch_sched() > > 4. Releasing the (old) pcpu schedule lock in schedule_cpu_switch(). > > > > What do you think? > > > I think it should work. We'll be doing the scheduler lock > manipulation > protected by the old and wrong (as it's the one of another scheduler) > per-cpu/runq lock, and the correct global private lock. It would look > like the ordering between the two locks is the wrong one, in Credit2, > but it's not because of the fact that the per-runq lock is the other > scheduler's one. > > Tricky, but everything is in here! :-/ > I've done it as you suggest above. The new .switch_sched hook is still there, and still does look the same. But I do indeed like the final look of the code better, and it appears to be working ok. Have a look. ;-) > > As an aside -- it seems to me that as soon as we change the > > scheduler > > lock, there's a risk that something else may come along and try to > > grab > > it / access the data. Does that mean we really ought to use memory > > barriers to make sure that the lock is written only after all > > changes > > to > > the scheduler data have been appropriately made? > > > Yes, if it were only for this code, I think you're right, barriers > would be necessary. I once again think this is actually safe, because > it's serialized elsewhere, but thinking more about it, I can well add > both barriers (and a comment). > And I've added smp_mb()-s too. > > > This also means that, in Credit2 and RTDS, we can get rid > > > of the code that was doing the scheduler lock remapping > > > in csched2_free_pdata() and rt_free_pdata(), and of their > > > triggering ASSERT-s. > > Right -- so to put it a different way, *all* schedulers must now > > set > > the > > locking scheme they wish to use, even if they want to use the > > default > > per-cpu locks. > > > Exactly. > > > > > I think that means we have to do that for arinc653 too, > > right? > > > Mmm... right, I'll have a look at that. > And, finally, I did have a look at this too, and I actually don't think ARINC needs any of this. In fact, ARINC brings the idea of "doing its own locking" much further than the other schedulers we have. They have their lock and they use it in such a way that they don't even care to what {v,p}cpu_schedule_lock() and friends points to. As an example, check a653sched_do_schedule(). It's called from schedule(), right after taking the runqueue lock, with pcpu_schedule_lock_irq(), and yet it does this: spin_lock_irqsave(&sched_priv->lock, flags); So, I actually better _not_ add anything to this series that re-maps sd->schedule_lock to point to their sched_priv->lock, or we'd deadlock! I'm not sure the design behind all this is the best possible one, but that's a different issue, to be dealt with with another series in another moment. :-) In any case, I've added Robert and Josh. Regards, Dario
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c index 929ba9c..903a704 100644 --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -577,6 +577,15 @@ csched_init_pdata(const struct scheduler *ops, void *pdata, int cpu) { unsigned long flags; struct csched_private *prv = CSCHED_PRIV(ops); + struct schedule_data *sd = &per_cpu(schedule_data, cpu); + + /* + * This is called either during during boot, resume or hotplug, in + * case Credit1 is the scheduler chosen at boot. In such cases, the + * scheduler lock for cpu is already pointing to the default per-cpu + * spinlock, as Credit1 needs it, so there is no remapping to be done. + */ + ASSERT(sd->schedule_lock == &sd->_lock && !spin_is_locked(&sd->_lock)); spin_lock_irqsave(&prv->lock, flags); init_pdata(prv, pdata, cpu); diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 25d8e85..64fb028 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -1974,7 +1974,6 @@ init_pdata(struct csched2_private *prv, unsigned int cpu) { unsigned rqi; struct csched2_runqueue_data *rqd; - spinlock_t *old_lock; ASSERT(spin_is_locked(&prv->lock)); ASSERT(!cpumask_test_cpu(cpu, &prv->initialized)); @@ -2005,21 +2004,11 @@ init_pdata(struct csched2_private *prv, unsigned int cpu) activate_runqueue(prv, rqi); } - /* IRQs already disabled */ - old_lock = pcpu_schedule_lock(cpu); - - /* Move spinlock to new runq lock. */ - per_cpu(schedule_data, cpu).schedule_lock = &rqd->lock; - /* Set the runqueue map */ prv->runq_map[cpu] = rqi; cpumask_set_cpu(cpu, &rqd->idle); cpumask_set_cpu(cpu, &rqd->active); - - /* _Not_ pcpu_schedule_unlock(): per_cpu().schedule_lock changed! */ - spin_unlock(old_lock); - cpumask_set_cpu(cpu, &prv->initialized); return rqi; @@ -2029,10 +2018,19 @@ static void csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu) { struct csched2_private *prv = CSCHED2_PRIV(ops); + spinlock_t *old_lock; unsigned long flags; + unsigned rqi; spin_lock_irqsave(&prv->lock, flags); - init_pdata(prv, cpu); + old_lock = pcpu_schedule_lock(cpu); + + rqi = init_pdata(prv, cpu); + /* Move the scheduler lock to the new runq lock. */ + per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock; + + /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */ + spin_unlock(old_lock); spin_unlock_irqrestore(&prv->lock, flags); } @@ -2079,7 +2077,6 @@ csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu) unsigned long flags; struct csched2_private *prv = CSCHED2_PRIV(ops); struct csched2_runqueue_data *rqd; - struct schedule_data *sd = &per_cpu(schedule_data, cpu); int rqi; spin_lock_irqsave(&prv->lock, flags); @@ -2107,11 +2104,6 @@ csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu) deactivate_runqueue(prv, rqi); } - /* Move spinlock to the original lock. */ - ASSERT(sd->schedule_lock == &rqd->lock); - ASSERT(!spin_is_locked(&sd->_lock)); - sd->schedule_lock = &sd->_lock; - spin_unlock(&rqd->lock); cpumask_clear_cpu(cpu, &prv->initialized); diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c index 92be248..0564b1d 100644 --- a/xen/common/sched_rt.c +++ b/xen/common/sched_rt.c @@ -718,19 +718,6 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu) static void rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu) { - struct rt_private *prv = rt_priv(ops); - struct schedule_data *sd = &per_cpu(schedule_data, cpu); - unsigned long flags; - - spin_lock_irqsave(&prv->lock, flags); - - /* Move spinlock back to the default lock */ - ASSERT(sd->schedule_lock == &prv->lock); - ASSERT(!spin_is_locked(&sd->_lock)); - sd->schedule_lock = &sd->_lock; - - spin_unlock_irqrestore(&prv->lock, flags); - free_cpumask_var(_cpumask_scratch[cpu]); } diff --git a/xen/common/schedule.c b/xen/common/schedule.c index 1adc0e2..29582a6 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -1617,7 +1617,6 @@ void __init scheduler_init(void) int schedule_cpu_switch(unsigned int cpu, struct cpupool *c) { struct vcpu *idle; - spinlock_t *lock; void *ppriv, *ppriv_old, *vpriv, *vpriv_old; struct scheduler *old_ops = per_cpu(scheduler, cpu); struct scheduler *new_ops = (c == NULL) ? &ops : c->sched; @@ -1640,11 +1639,21 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c) if ( old_ops == new_ops ) goto out; + /* + * To setup the cpu for the new scheduler we need: + * - a valid instance of per-CPU scheduler specific data, as it is + * allocated by SCHED_OP(alloc_pdata). Note that we do not want to + * initialize it yet (i.e., we are not calling SCHED_OP(init_pdata)). + * That will be done by the target scheduler, in SCHED_OP(switch_sched), + * in proper ordering and with locking. + * - a valid instance of per-vCPU scheduler specific data, for the idle + * vCPU of cpu. That is what the target scheduler will use for the + * sched_priv field of the per-vCPU info of the idle domain. + */ idle = idle_vcpu[cpu]; ppriv = SCHED_OP(new_ops, alloc_pdata, cpu); if ( IS_ERR(ppriv) ) return PTR_ERR(ppriv); - SCHED_OP(new_ops, init_pdata, ppriv, cpu); vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv); if ( vpriv == NULL ) { @@ -1652,17 +1661,20 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c) return -ENOMEM; } - lock = pcpu_schedule_lock_irq(cpu); - SCHED_OP(old_ops, tick_suspend, cpu); + + /* + * The actual switch, including (if necessary) the rerouting of the + * scheduler lock to whatever new_ops prefers, needs to happen in one + * critical section, protected by old_ops' lock, or races are possible. + * Since each scheduler has its own contraints and locking scheme, do + * that inside specific scheduler code, rather than here. + */ vpriv_old = idle->sched_priv; - idle->sched_priv = vpriv; - per_cpu(scheduler, cpu) = new_ops; ppriv_old = per_cpu(schedule_data, cpu).sched_priv; - per_cpu(schedule_data, cpu).sched_priv = ppriv; - SCHED_OP(new_ops, tick_resume, cpu); + SCHED_OP(new_ops, switch_sched, cpu, ppriv, vpriv); - pcpu_schedule_unlock_irq(lock, cpu); + SCHED_OP(new_ops, tick_resume, cpu); SCHED_OP(old_ops, free_vdata, vpriv_old); SCHED_OP(old_ops, free_pdata, ppriv_old, cpu);
by using the sched_switch hook that we have introduced in the various schedulers. The key is to let the actual switch of scheduler and the remapping of the scheduler lock for the CPU (if necessary) happen together (in the same critical section) protected (at least) by the old scheduler lock for the CPU. This also means that, in Credit2 and RTDS, we can get rid of the code that was doing the scheduler lock remapping in csched2_free_pdata() and rt_free_pdata(), and of their triggering ASSERT-s. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Cc: George Dunlap <george.dunlap@eu.citrix.com> Cc: Meng Xu <mengxu@cis.upenn.edu> Cc: Tianyang Chen <tiche@seas.upenn.edu> --- xen/common/sched_credit.c | 9 +++++++++ xen/common/sched_credit2.c | 28 ++++++++++------------------ xen/common/sched_rt.c | 13 ------------- xen/common/schedule.c | 30 +++++++++++++++++++++--------- 4 files changed, 40 insertions(+), 40 deletions(-)