Message ID | 20230501203046.168856-1-stewart.hildebrand@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v1] xen/sched/null: avoid crash after failed domU creation | expand |
On 01.05.23 22:30, Stewart Hildebrand wrote: > When creating a domU, but the creation fails, there is a corner case that may > lead to a crash in the null scheduler when running a debug build of Xen. > > (XEN) **************************************** > (XEN) Panic on CPU 0: > (XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379 > (XEN) **************************************** > > The events leading to the crash are: > > * null_unit_insert() was invoked with the unit offline. Since the unit was > offline, unit_assign() was not called, and null_unit_insert() returned. > * Later during domain creation, the unit was onlined > * Eventually, domain creation failed due to bad configuration > * null_unit_remove() was invoked with the unit still online. Since the unit was > online, it called unit_deassign() and triggered an ASSERT. > > To fix this, only call unit_deassign() when npc->unit is non-NULL in > null_unit_remove. > > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Juergen
On 5/5/23 01:59, Juergen Gross wrote: > On 01.05.23 22:30, Stewart Hildebrand wrote: >> When creating a domU, but the creation fails, there is a corner case that may >> lead to a crash in the null scheduler when running a debug build of Xen. >> >> (XEN) **************************************** >> (XEN) Panic on CPU 0: >> (XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379 >> (XEN) **************************************** >> >> The events leading to the crash are: >> >> * null_unit_insert() was invoked with the unit offline. Since the unit was >> offline, unit_assign() was not called, and null_unit_insert() returned. >> * Later during domain creation, the unit was onlined >> * Eventually, domain creation failed due to bad configuration >> * null_unit_remove() was invoked with the unit still online. Since the unit was >> online, it called unit_deassign() and triggered an ASSERT. >> >> To fix this, only call unit_deassign() when npc->unit is non-NULL in >> null_unit_remove. >> >> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> > > Reviewed-by: Juergen Gross <jgross@suse.com> Thanks for the review. Does this still need a maintainer ack?
On 18.05.2023 23:27, Stewart Hildebrand wrote: > On 5/5/23 01:59, Juergen Gross wrote: >> On 01.05.23 22:30, Stewart Hildebrand wrote: >>> When creating a domU, but the creation fails, there is a corner case that may >>> lead to a crash in the null scheduler when running a debug build of Xen. >>> >>> (XEN) **************************************** >>> (XEN) Panic on CPU 0: >>> (XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379 >>> (XEN) **************************************** >>> >>> The events leading to the crash are: >>> >>> * null_unit_insert() was invoked with the unit offline. Since the unit was >>> offline, unit_assign() was not called, and null_unit_insert() returned. >>> * Later during domain creation, the unit was onlined >>> * Eventually, domain creation failed due to bad configuration >>> * null_unit_remove() was invoked with the unit still online. Since the unit was >>> online, it called unit_deassign() and triggered an ASSERT. >>> >>> To fix this, only call unit_deassign() when npc->unit is non-NULL in >>> null_unit_remove. >>> >>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> >> >> Reviewed-by: Juergen Gross <jgross@suse.com> > > Thanks for the review. Does this still need a maintainer ack? In principle yes. I might be willing to time out at some point, but not before at least one ping was sent (and some more time has passed afterwards). Jan
On Thu, 2023-05-18 at 17:27 -0400, Stewart Hildebrand wrote: > On 5/5/23 01:59, Juergen Gross wrote: > > > > > > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> > > > > Reviewed-by: Juergen Gross <jgross@suse.com> > > Thanks for the review. Does this still need a maintainer ack? > Acked-by: Dario Faggioli <dfaggioli@suse.com> Regards
diff --git a/xen/common/sched/null.c b/xen/common/sched/null.c index 65a0a6c5312d..2091337fcd06 100644 --- a/xen/common/sched/null.c +++ b/xen/common/sched/null.c @@ -522,6 +522,8 @@ static void cf_check null_unit_remove( { struct null_private *prv = null_priv(ops); struct null_unit *nvc = null_unit(unit); + struct null_pcpu *npc; + unsigned int cpu; spinlock_t *lock; ASSERT(!is_idle_unit(unit)); @@ -531,8 +533,6 @@ static void cf_check null_unit_remove( /* If offline, the unit shouldn't be assigned, nor in the waitqueue */ if ( unlikely(!is_unit_online(unit)) ) { - struct null_pcpu *npc; - npc = unit->res->sched_priv; ASSERT(npc->unit != unit); ASSERT(list_empty(&nvc->waitq_elem)); @@ -549,7 +549,10 @@ static void cf_check null_unit_remove( goto out; } - unit_deassign(prv, unit); + cpu = sched_unit_master(unit); + npc = get_sched_res(cpu)->sched_priv; + if ( npc->unit ) + unit_deassign(prv, unit); out: unit_schedule_unlock_irq(lock, unit);
When creating a domU, but the creation fails, there is a corner case that may lead to a crash in the null scheduler when running a debug build of Xen. (XEN) **************************************** (XEN) Panic on CPU 0: (XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379 (XEN) **************************************** The events leading to the crash are: * null_unit_insert() was invoked with the unit offline. Since the unit was offline, unit_assign() was not called, and null_unit_insert() returned. * Later during domain creation, the unit was onlined * Eventually, domain creation failed due to bad configuration * null_unit_remove() was invoked with the unit still online. Since the unit was online, it called unit_deassign() and triggered an ASSERT. To fix this, only call unit_deassign() when npc->unit is non-NULL in null_unit_remove. Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> --- RFC->v1 * Follow Juergen's suggested fix Link to RFC [1] [1] https://lists.xenproject.org/archives/html/xen-devel/2023-04/msg01387.html --- xen/common/sched/null.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)