Message ID | 20231019112314.22665-1-jgross@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xen/sched: fix sched_move_domain() | expand |
Hi Juergen, > On Oct 19, 2023, at 19:23, Juergen Gross <jgross@suse.com> wrote: > > When moving a domain out of a cpupool running with the credit2 > scheduler and having multiple run-queues, the following ASSERT() can > be observed: > > (XEN) Xen call trace: > (XEN) [<ffff82d04023a700>] R credit2.c#csched2_unit_remove+0xe3/0xe7 > (XEN) [<ffff82d040246adb>] S sched_move_domain+0x2f3/0x5b1 > (XEN) [<ffff82d040234cf7>] S cpupool.c#cpupool_move_domain_locked+0x1d/0x3b > (XEN) [<ffff82d040236025>] S cpupool_move_domain+0x24/0x35 > (XEN) [<ffff82d040206513>] S domain_kill+0xa5/0x116 > (XEN) [<ffff82d040232b12>] S do_domctl+0xe5f/0x1951 > (XEN) [<ffff82d0402276ba>] S timer.c#timer_lock+0x69/0x143 > (XEN) [<ffff82d0402dc71b>] S pv_hypercall+0x44e/0x4a9 > (XEN) [<ffff82d0402012b7>] S lstar_enter+0x137/0x140 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 1: > (XEN) Assertion 'svc->rqd == c2rqd(sched_unit_master(unit))' failed at common/sched/credit2.c:1159 > (XEN) **************************************** > > This is happening as sched_move_domain() is setting a different cpu > for a scheduling unit without telling the scheduler. When this unit is > removed from the scheduler, the ASSERT() will trigger. > > In non-debug builds the result is usually a clobbered pointer, leading > to another crash a short time later. > > Fix that by swapping the two involved actions (setting another cpu and > removing the unit from the scheduler). > > Cc: Henry Wang <Henry.Wang@arm.com> Emmm, I think ^ this CC is better to me moved to the scissors line, otherwise if this patch is committed, this line will be shown in the commit message... > Fixes: 70fadc41635b ("xen/cpupool: support moving domain between cpupools with different granularity") > Signed-off-by: Juergen Gross <jgross@suse.com> > --- > This fixes a regression introduced in Xen 4.15. The fix is very simple > and it will affect only configurations with multiple cpupools. I think > whether to include it in 4.18 should be decided by the release manager > based on the current state of the release (I think I wouldn't have > added it that late in the release while being the release manager). Thanks for the reminder :) Please correct me if I am wrong, if this is fixing the regression introduced in 4.15, shouldn’t this patch being backported to 4.15, 4.16, 4.17 and soon 4.18? So honestly I think at least for 4.18 either add this patch now or later won’t make much difference…I am ok either way I guess. Kind regards, Henry
On 19.10.23 13:31, Henry Wang wrote: > Hi Juergen, > >> On Oct 19, 2023, at 19:23, Juergen Gross <jgross@suse.com> wrote: >> >> When moving a domain out of a cpupool running with the credit2 >> scheduler and having multiple run-queues, the following ASSERT() can >> be observed: >> >> (XEN) Xen call trace: >> (XEN) [<ffff82d04023a700>] R credit2.c#csched2_unit_remove+0xe3/0xe7 >> (XEN) [<ffff82d040246adb>] S sched_move_domain+0x2f3/0x5b1 >> (XEN) [<ffff82d040234cf7>] S cpupool.c#cpupool_move_domain_locked+0x1d/0x3b >> (XEN) [<ffff82d040236025>] S cpupool_move_domain+0x24/0x35 >> (XEN) [<ffff82d040206513>] S domain_kill+0xa5/0x116 >> (XEN) [<ffff82d040232b12>] S do_domctl+0xe5f/0x1951 >> (XEN) [<ffff82d0402276ba>] S timer.c#timer_lock+0x69/0x143 >> (XEN) [<ffff82d0402dc71b>] S pv_hypercall+0x44e/0x4a9 >> (XEN) [<ffff82d0402012b7>] S lstar_enter+0x137/0x140 >> (XEN) >> (XEN) >> (XEN) **************************************** >> (XEN) Panic on CPU 1: >> (XEN) Assertion 'svc->rqd == c2rqd(sched_unit_master(unit))' failed at common/sched/credit2.c:1159 >> (XEN) **************************************** >> >> This is happening as sched_move_domain() is setting a different cpu >> for a scheduling unit without telling the scheduler. When this unit is >> removed from the scheduler, the ASSERT() will trigger. >> >> In non-debug builds the result is usually a clobbered pointer, leading >> to another crash a short time later. >> >> Fix that by swapping the two involved actions (setting another cpu and >> removing the unit from the scheduler). >> >> Cc: Henry Wang <Henry.Wang@arm.com> > > Emmm, I think ^ this CC is better to me moved to the scissors line, otherwise > if this patch is committed, this line will be shown in the commit message... > >> Fixes: 70fadc41635b ("xen/cpupool: support moving domain between cpupools with different granularity") >> Signed-off-by: Juergen Gross <jgross@suse.com> >> --- >> This fixes a regression introduced in Xen 4.15. The fix is very simple >> and it will affect only configurations with multiple cpupools. I think >> whether to include it in 4.18 should be decided by the release manager >> based on the current state of the release (I think I wouldn't have >> added it that late in the release while being the release manager). > > Thanks for the reminder :) > > Please correct me if I am wrong, if this is fixing the regression introduced in > 4.15, shouldn’t this patch being backported to 4.15, 4.16, 4.17 and soon > 4.18? So honestly I think at least for 4.18 either add this patch now or > later won’t make much difference…I am ok either way I guess. You are right, the patch needs to be backported. OTOH nobody noticed the regression until a few days ago. So delaying the fix for a few weeks would probably not hurt too many people. Each patch changing code is a risk to introduce another regression, so its your decision whether you want to take this risk. Especially changes like this one touching the core scheduling code always have a latent risk to open up a small race window (this could be said for the original patch I'm fixing here, too :-) ). Juergen
Hi Juergen, > On Oct 19, 2023, at 19:38, Juergen Gross <jgross@suse.com> wrote: > > On 19.10.23 13:31, Henry Wang wrote: >> Hi Juergen, >>> On Oct 19, 2023, at 19:23, Juergen Gross <jgross@suse.com> wrote: >>> >>> When moving a domain out of a cpupool running with the credit2 >>> scheduler and having multiple run-queues, the following ASSERT() can >>> be observed: >>> >>> (XEN) Xen call trace: >>> (XEN) [<ffff82d04023a700>] R credit2.c#csched2_unit_remove+0xe3/0xe7 >>> (XEN) [<ffff82d040246adb>] S sched_move_domain+0x2f3/0x5b1 >>> (XEN) [<ffff82d040234cf7>] S cpupool.c#cpupool_move_domain_locked+0x1d/0x3b >>> (XEN) [<ffff82d040236025>] S cpupool_move_domain+0x24/0x35 >>> (XEN) [<ffff82d040206513>] S domain_kill+0xa5/0x116 >>> (XEN) [<ffff82d040232b12>] S do_domctl+0xe5f/0x1951 >>> (XEN) [<ffff82d0402276ba>] S timer.c#timer_lock+0x69/0x143 >>> (XEN) [<ffff82d0402dc71b>] S pv_hypercall+0x44e/0x4a9 >>> (XEN) [<ffff82d0402012b7>] S lstar_enter+0x137/0x140 >>> (XEN) >>> (XEN) >>> (XEN) **************************************** >>> (XEN) Panic on CPU 1: >>> (XEN) Assertion 'svc->rqd == c2rqd(sched_unit_master(unit))' failed at common/sched/credit2.c:1159 >>> (XEN) **************************************** >>> >>> This is happening as sched_move_domain() is setting a different cpu >>> for a scheduling unit without telling the scheduler. When this unit is >>> removed from the scheduler, the ASSERT() will trigger. >>> >>> In non-debug builds the result is usually a clobbered pointer, leading >>> to another crash a short time later. >>> >>> Fix that by swapping the two involved actions (setting another cpu and >>> removing the unit from the scheduler). >>> >>> Cc: Henry Wang <Henry.Wang@arm.com> >> Emmm, I think ^ this CC is better to me moved to the scissors line, otherwise >> if this patch is committed, this line will be shown in the commit message... >>> Fixes: 70fadc41635b ("xen/cpupool: support moving domain between cpupools with different granularity") >>> Signed-off-by: Juergen Gross <jgross@suse.com> >>> --- >>> This fixes a regression introduced in Xen 4.15. The fix is very simple >>> and it will affect only configurations with multiple cpupools. I think >>> whether to include it in 4.18 should be decided by the release manager >>> based on the current state of the release (I think I wouldn't have >>> added it that late in the release while being the release manager). >> Thanks for the reminder :) >> Please correct me if I am wrong, if this is fixing the regression introduced in >> 4.15, shouldn’t this patch being backported to 4.15, 4.16, 4.17 and soon >> 4.18? So honestly I think at least for 4.18 either add this patch now or >> later won’t make much difference…I am ok either way I guess. > > You are right, the patch needs to be backported. > > OTOH nobody noticed the regression until a few days ago. So delaying the fix > for a few weeks would probably not hurt too many people. I am planning to branch next week, so I would say this patch probably will be delayed max 1 week I guess. > Each patch changing > code is a risk to introduce another regression, so its your decision whether > you want to take this risk. Especially changes like this one touching the > core scheduling code always have a latent risk to open up a small race window > (this could be said for the original patch I'm fixing here, too :-) ). That said, given the fact that this patch is fixing a regression years ago and will be backported to the stable releases, I am more leaning towards merging this patch when the tree reopens (unless others strongly object), so that this patch will get the opportunity to be tested properly, and we won’t take too much risk to delay the 4.18 release. Also this decision is consistent with your (an ex release manager) above words in the scissors line :) Hopefully you will also be ok with that. Kind regards, Henry > > > Juergen > <OpenPGP_0xB0DE9DD628BF132F.asc>
On 19.10.23 13:49, Henry Wang wrote: > Hi Juergen, > >> On Oct 19, 2023, at 19:38, Juergen Gross <jgross@suse.com> wrote: >> >> On 19.10.23 13:31, Henry Wang wrote: >>> Hi Juergen, >>>> On Oct 19, 2023, at 19:23, Juergen Gross <jgross@suse.com> wrote: >>>> >>>> When moving a domain out of a cpupool running with the credit2 >>>> scheduler and having multiple run-queues, the following ASSERT() can >>>> be observed: >>>> >>>> (XEN) Xen call trace: >>>> (XEN) [<ffff82d04023a700>] R credit2.c#csched2_unit_remove+0xe3/0xe7 >>>> (XEN) [<ffff82d040246adb>] S sched_move_domain+0x2f3/0x5b1 >>>> (XEN) [<ffff82d040234cf7>] S cpupool.c#cpupool_move_domain_locked+0x1d/0x3b >>>> (XEN) [<ffff82d040236025>] S cpupool_move_domain+0x24/0x35 >>>> (XEN) [<ffff82d040206513>] S domain_kill+0xa5/0x116 >>>> (XEN) [<ffff82d040232b12>] S do_domctl+0xe5f/0x1951 >>>> (XEN) [<ffff82d0402276ba>] S timer.c#timer_lock+0x69/0x143 >>>> (XEN) [<ffff82d0402dc71b>] S pv_hypercall+0x44e/0x4a9 >>>> (XEN) [<ffff82d0402012b7>] S lstar_enter+0x137/0x140 >>>> (XEN) >>>> (XEN) >>>> (XEN) **************************************** >>>> (XEN) Panic on CPU 1: >>>> (XEN) Assertion 'svc->rqd == c2rqd(sched_unit_master(unit))' failed at common/sched/credit2.c:1159 >>>> (XEN) **************************************** >>>> >>>> This is happening as sched_move_domain() is setting a different cpu >>>> for a scheduling unit without telling the scheduler. When this unit is >>>> removed from the scheduler, the ASSERT() will trigger. >>>> >>>> In non-debug builds the result is usually a clobbered pointer, leading >>>> to another crash a short time later. >>>> >>>> Fix that by swapping the two involved actions (setting another cpu and >>>> removing the unit from the scheduler). >>>> >>>> Cc: Henry Wang <Henry.Wang@arm.com> >>> Emmm, I think ^ this CC is better to me moved to the scissors line, otherwise >>> if this patch is committed, this line will be shown in the commit message... >>>> Fixes: 70fadc41635b ("xen/cpupool: support moving domain between cpupools with different granularity") >>>> Signed-off-by: Juergen Gross <jgross@suse.com> >>>> --- >>>> This fixes a regression introduced in Xen 4.15. The fix is very simple >>>> and it will affect only configurations with multiple cpupools. I think >>>> whether to include it in 4.18 should be decided by the release manager >>>> based on the current state of the release (I think I wouldn't have >>>> added it that late in the release while being the release manager). >>> Thanks for the reminder :) >>> Please correct me if I am wrong, if this is fixing the regression introduced in >>> 4.15, shouldn’t this patch being backported to 4.15, 4.16, 4.17 and soon >>> 4.18? So honestly I think at least for 4.18 either add this patch now or >>> later won’t make much difference…I am ok either way I guess. >> >> You are right, the patch needs to be backported. >> >> OTOH nobody noticed the regression until a few days ago. So delaying the fix >> for a few weeks would probably not hurt too many people. > > I am planning to branch next week, so I would say this patch probably will be > delayed max 1 week I guess. > >> Each patch changing >> code is a risk to introduce another regression, so its your decision whether >> you want to take this risk. Especially changes like this one touching the >> core scheduling code always have a latent risk to open up a small race window >> (this could be said for the original patch I'm fixing here, too :-) ). > > That said, given the fact that this patch is fixing a regression years ago and > will be backported to the stable releases, I am more leaning towards merging > this patch when the tree reopens (unless others strongly object), so that this > patch will get the opportunity to be tested properly, and we won’t take too much > risk to delay the 4.18 release. Also this decision is consistent with your (an > ex release manager) above words in the scissors line :) > > Hopefully you will also be ok with that. Of course I am. Juergen
On 19/10/2023 12:23 pm, Juergen Gross wrote: > When moving a domain out of a cpupool running with the credit2 > scheduler and having multiple run-queues, the following ASSERT() can > be observed: > > (XEN) Xen call trace: > (XEN) [<ffff82d04023a700>] R credit2.c#csched2_unit_remove+0xe3/0xe7 > (XEN) [<ffff82d040246adb>] S sched_move_domain+0x2f3/0x5b1 > (XEN) [<ffff82d040234cf7>] S cpupool.c#cpupool_move_domain_locked+0x1d/0x3b > (XEN) [<ffff82d040236025>] S cpupool_move_domain+0x24/0x35 > (XEN) [<ffff82d040206513>] S domain_kill+0xa5/0x116 > (XEN) [<ffff82d040232b12>] S do_domctl+0xe5f/0x1951 > (XEN) [<ffff82d0402276ba>] S timer.c#timer_lock+0x69/0x143 > (XEN) [<ffff82d0402dc71b>] S pv_hypercall+0x44e/0x4a9 > (XEN) [<ffff82d0402012b7>] S lstar_enter+0x137/0x140 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 1: > (XEN) Assertion 'svc->rqd == c2rqd(sched_unit_master(unit))' failed at common/sched/credit2.c:1159 > (XEN) **************************************** > > This is happening as sched_move_domain() is setting a different cpu > for a scheduling unit without telling the scheduler. When this unit is > removed from the scheduler, the ASSERT() will trigger. > > In non-debug builds the result is usually a clobbered pointer, leading > to another crash a short time later. > > Fix that by swapping the two involved actions (setting another cpu and > removing the unit from the scheduler). > > Cc: Henry Wang <Henry.Wang@arm.com> > Fixes: 70fadc41635b ("xen/cpupool: support moving domain between cpupools with different granularity") Link: https://github.com/Dasharo/dasharo-issues/issues/488 And a Reported/Tested-by if the discoverer wishes. > Signed-off-by: Juergen Gross <jgross@suse.com> > --- > This fixes a regression introduced in Xen 4.15. The fix is very simple > and it will affect only configurations with multiple cpupools. I think > whether to include it in 4.18 should be decided by the release manager > based on the current state of the release (I think I wouldn't have > added it that late in the release while being the release manager). > --- > xen/common/sched/core.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c > index 12deefa745..e9f7486197 100644 > --- a/xen/common/sched/core.c > +++ b/xen/common/sched/core.c > @@ -738,12 +738,13 @@ int sched_move_domain(struct domain *d, struct cpupool *c) > new_p = cpumask_first(d->cpupool->cpu_valid); > for_each_sched_unit ( d, unit ) > { > - spinlock_t *lock = unit_schedule_lock_irq(unit); > + spinlock_t *lock; > + > + sched_remove_unit(old_ops, unit); > > + lock = unit_schedule_lock_irq(unit); > sched_set_res(unit, get_sched_res(new_p)); > spin_unlock_irq(lock); > - > - sched_remove_unit(old_ops, unit); I'm happy to see the T-by, and that you know what's going on here, but I don't understand the explanation. The change here is the ordering of sched_remove_unit() with respect to the lock/get&set/unlock block. remove unit is moderately complicated, but the get&set is just an RCU pointer assignment. But as the assertion fires in the latter action, it must be the get&set causing problems. And that's because new_p is actually a cpu number, which has the consequence of causing Credit2's c2rqd() to materialise the wrong csched2_runqueue_data pointer, and then we're operating on someone else's data without a suitable lock held. And it's only by luck that none of the other schedulers tie something to per-cpu data like this? Other observations. I think sched_move_domain() would be easier to follow with s/new_p/new_cpu/ (and similar for unit_p) as "p" is not obviously processor unless you notice the cpumask_first() calls. Why do we move a domain back to cpupool0 on domain destroy? There's nothing magic about cpupool0 that I'm aware of here. It seems like unnecessary complexity. ~Andrew
On 20.10.23 20:03, Andrew Cooper wrote: > On 19/10/2023 12:23 pm, Juergen Gross wrote: >> When moving a domain out of a cpupool running with the credit2 >> scheduler and having multiple run-queues, the following ASSERT() can >> be observed: >> >> (XEN) Xen call trace: >> (XEN) [<ffff82d04023a700>] R credit2.c#csched2_unit_remove+0xe3/0xe7 >> (XEN) [<ffff82d040246adb>] S sched_move_domain+0x2f3/0x5b1 >> (XEN) [<ffff82d040234cf7>] S cpupool.c#cpupool_move_domain_locked+0x1d/0x3b >> (XEN) [<ffff82d040236025>] S cpupool_move_domain+0x24/0x35 >> (XEN) [<ffff82d040206513>] S domain_kill+0xa5/0x116 >> (XEN) [<ffff82d040232b12>] S do_domctl+0xe5f/0x1951 >> (XEN) [<ffff82d0402276ba>] S timer.c#timer_lock+0x69/0x143 >> (XEN) [<ffff82d0402dc71b>] S pv_hypercall+0x44e/0x4a9 >> (XEN) [<ffff82d0402012b7>] S lstar_enter+0x137/0x140 >> (XEN) >> (XEN) >> (XEN) **************************************** >> (XEN) Panic on CPU 1: >> (XEN) Assertion 'svc->rqd == c2rqd(sched_unit_master(unit))' failed at common/sched/credit2.c:1159 >> (XEN) **************************************** >> >> This is happening as sched_move_domain() is setting a different cpu >> for a scheduling unit without telling the scheduler. When this unit is >> removed from the scheduler, the ASSERT() will trigger. >> >> In non-debug builds the result is usually a clobbered pointer, leading >> to another crash a short time later. >> >> Fix that by swapping the two involved actions (setting another cpu and >> removing the unit from the scheduler). >> >> Cc: Henry Wang <Henry.Wang@arm.com> >> Fixes: 70fadc41635b ("xen/cpupool: support moving domain between cpupools with different granularity") > > Link: https://github.com/Dasharo/dasharo-issues/issues/488 > > And a Reported/Tested-by if the discoverer wishes. > >> Signed-off-by: Juergen Gross <jgross@suse.com> >> --- >> This fixes a regression introduced in Xen 4.15. The fix is very simple >> and it will affect only configurations with multiple cpupools. I think >> whether to include it in 4.18 should be decided by the release manager >> based on the current state of the release (I think I wouldn't have >> added it that late in the release while being the release manager). >> --- >> xen/common/sched/core.c | 7 ++++--- >> 1 file changed, 4 insertions(+), 3 deletions(-) >> >> diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c >> index 12deefa745..e9f7486197 100644 >> --- a/xen/common/sched/core.c >> +++ b/xen/common/sched/core.c >> @@ -738,12 +738,13 @@ int sched_move_domain(struct domain *d, struct cpupool *c) >> new_p = cpumask_first(d->cpupool->cpu_valid); >> for_each_sched_unit ( d, unit ) >> { >> - spinlock_t *lock = unit_schedule_lock_irq(unit); >> + spinlock_t *lock; >> + >> + sched_remove_unit(old_ops, unit); >> >> + lock = unit_schedule_lock_irq(unit); >> sched_set_res(unit, get_sched_res(new_p)); >> spin_unlock_irq(lock); >> - >> - sched_remove_unit(old_ops, unit); > > I'm happy to see the T-by, and that you know what's going on here, but I > don't understand the explanation. > > The change here is the ordering of sched_remove_unit() with respect to > the lock/get&set/unlock block. Yes. > remove unit is moderately complicated, but the get&set is just an RCU > pointer assignment. But as the assertion fires in the latter action, it > must be the get&set causing problems. Yes. As I wrote, it is because the cpu assignment of the unit is being changed without telling the specific scheduler (credit2) about it. > And that's because new_p is actually a cpu number, which has the > consequence of causing Credit2's c2rqd() to materialise the wrong > csched2_runqueue_data pointer, and then we're operating on someone > else's data without a suitable lock held. I haven't looked into the exact details what went wrong further down the road, but I think it is not only the missing lock, but probably the fact that depending on the use case different run queues will be addressed for the same unit, causing inconsistencies. > And it's only by luck that none of the other schedulers tie something to > per-cpu data like this? I didn't look into the other schedulers. > Other observations. > > I think sched_move_domain() would be easier to follow with > s/new_p/new_cpu/ (and similar for unit_p) as "p" is not obviously > processor unless you notice the cpumask_first() calls. Maybe. I don't have that problem, OTOH I've written that code. :-D > Why do we move a domain back to cpupool0 on domain destroy? There's > nothing magic about cpupool0 that I'm aware of here. It seems like > unnecessary complexity. See commit bac6334b51d9. Juergen
diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c index 12deefa745..e9f7486197 100644 --- a/xen/common/sched/core.c +++ b/xen/common/sched/core.c @@ -738,12 +738,13 @@ int sched_move_domain(struct domain *d, struct cpupool *c) new_p = cpumask_first(d->cpupool->cpu_valid); for_each_sched_unit ( d, unit ) { - spinlock_t *lock = unit_schedule_lock_irq(unit); + spinlock_t *lock; + + sched_remove_unit(old_ops, unit); + lock = unit_schedule_lock_irq(unit); sched_set_res(unit, get_sched_res(new_p)); spin_unlock_irq(lock); - - sched_remove_unit(old_ops, unit); } old_units = d->sched_unit_list;
When moving a domain out of a cpupool running with the credit2 scheduler and having multiple run-queues, the following ASSERT() can be observed: (XEN) Xen call trace: (XEN) [<ffff82d04023a700>] R credit2.c#csched2_unit_remove+0xe3/0xe7 (XEN) [<ffff82d040246adb>] S sched_move_domain+0x2f3/0x5b1 (XEN) [<ffff82d040234cf7>] S cpupool.c#cpupool_move_domain_locked+0x1d/0x3b (XEN) [<ffff82d040236025>] S cpupool_move_domain+0x24/0x35 (XEN) [<ffff82d040206513>] S domain_kill+0xa5/0x116 (XEN) [<ffff82d040232b12>] S do_domctl+0xe5f/0x1951 (XEN) [<ffff82d0402276ba>] S timer.c#timer_lock+0x69/0x143 (XEN) [<ffff82d0402dc71b>] S pv_hypercall+0x44e/0x4a9 (XEN) [<ffff82d0402012b7>] S lstar_enter+0x137/0x140 (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 1: (XEN) Assertion 'svc->rqd == c2rqd(sched_unit_master(unit))' failed at common/sched/credit2.c:1159 (XEN) **************************************** This is happening as sched_move_domain() is setting a different cpu for a scheduling unit without telling the scheduler. When this unit is removed from the scheduler, the ASSERT() will trigger. In non-debug builds the result is usually a clobbered pointer, leading to another crash a short time later. Fix that by swapping the two involved actions (setting another cpu and removing the unit from the scheduler). Cc: Henry Wang <Henry.Wang@arm.com> Fixes: 70fadc41635b ("xen/cpupool: support moving domain between cpupools with different granularity") Signed-off-by: Juergen Gross <jgross@suse.com> --- This fixes a regression introduced in Xen 4.15. The fix is very simple and it will affect only configurations with multiple cpupools. I think whether to include it in 4.18 should be decided by the release manager based on the current state of the release (I think I wouldn't have added it that late in the release while being the release manager). --- xen/common/sched/core.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)