diff mbox series

[v1] xen/sched/null: avoid crash after failed domU creation

Message ID 20230501203046.168856-1-stewart.hildebrand@amd.com (mailing list archive)
State New, archived
Headers show
Series [v1] xen/sched/null: avoid crash after failed domU creation | expand

Commit Message

Stewart Hildebrand May 1, 2023, 8:30 p.m. UTC
When creating a domU, but the creation fails, there is a corner case that may
lead to a crash in the null scheduler when running a debug build of Xen.

(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379
(XEN) ****************************************

The events leading to the crash are:

* null_unit_insert() was invoked with the unit offline. Since the unit was
  offline, unit_assign() was not called, and null_unit_insert() returned.
* Later during domain creation, the unit was onlined
* Eventually, domain creation failed due to bad configuration
* null_unit_remove() was invoked with the unit still online. Since the unit was
  online, it called unit_deassign() and triggered an ASSERT.

To fix this, only call unit_deassign() when npc->unit is non-NULL in
null_unit_remove.

Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
RFC->v1
* Follow Juergen's suggested fix

Link to RFC [1]

[1] https://lists.xenproject.org/archives/html/xen-devel/2023-04/msg01387.html
---
 xen/common/sched/null.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Comments

Jürgen Groß May 5, 2023, 5:59 a.m. UTC | #1
On 01.05.23 22:30, Stewart Hildebrand wrote:
> When creating a domU, but the creation fails, there is a corner case that may
> lead to a crash in the null scheduler when running a debug build of Xen.
> 
> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379
> (XEN) ****************************************
> 
> The events leading to the crash are:
> 
> * null_unit_insert() was invoked with the unit offline. Since the unit was
>    offline, unit_assign() was not called, and null_unit_insert() returned.
> * Later during domain creation, the unit was onlined
> * Eventually, domain creation failed due to bad configuration
> * null_unit_remove() was invoked with the unit still online. Since the unit was
>    online, it called unit_deassign() and triggered an ASSERT.
> 
> To fix this, only call unit_deassign() when npc->unit is non-NULL in
> null_unit_remove.
> 
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen
Stewart Hildebrand May 18, 2023, 9:27 p.m. UTC | #2
On 5/5/23 01:59, Juergen Gross wrote:
> On 01.05.23 22:30, Stewart Hildebrand wrote:
>> When creating a domU, but the creation fails, there is a corner case that may
>> lead to a crash in the null scheduler when running a debug build of Xen.
>>
>> (XEN) ****************************************
>> (XEN) Panic on CPU 0:
>> (XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379
>> (XEN) ****************************************
>>
>> The events leading to the crash are:
>>
>> * null_unit_insert() was invoked with the unit offline. Since the unit was
>>    offline, unit_assign() was not called, and null_unit_insert() returned.
>> * Later during domain creation, the unit was onlined
>> * Eventually, domain creation failed due to bad configuration
>> * null_unit_remove() was invoked with the unit still online. Since the unit was
>>    online, it called unit_deassign() and triggered an ASSERT.
>>
>> To fix this, only call unit_deassign() when npc->unit is non-NULL in
>> null_unit_remove.
>>
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> 
> Reviewed-by: Juergen Gross <jgross@suse.com>

Thanks for the review. Does this still need a maintainer ack?
Jan Beulich May 19, 2023, 8:33 a.m. UTC | #3
On 18.05.2023 23:27, Stewart Hildebrand wrote:
> On 5/5/23 01:59, Juergen Gross wrote:
>> On 01.05.23 22:30, Stewart Hildebrand wrote:
>>> When creating a domU, but the creation fails, there is a corner case that may
>>> lead to a crash in the null scheduler when running a debug build of Xen.
>>>
>>> (XEN) ****************************************
>>> (XEN) Panic on CPU 0:
>>> (XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379
>>> (XEN) ****************************************
>>>
>>> The events leading to the crash are:
>>>
>>> * null_unit_insert() was invoked with the unit offline. Since the unit was
>>>    offline, unit_assign() was not called, and null_unit_insert() returned.
>>> * Later during domain creation, the unit was onlined
>>> * Eventually, domain creation failed due to bad configuration
>>> * null_unit_remove() was invoked with the unit still online. Since the unit was
>>>    online, it called unit_deassign() and triggered an ASSERT.
>>>
>>> To fix this, only call unit_deassign() when npc->unit is non-NULL in
>>> null_unit_remove.
>>>
>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>
>> Reviewed-by: Juergen Gross <jgross@suse.com>
> 
> Thanks for the review. Does this still need a maintainer ack?

In principle yes. I might be willing to time out at some point, but
not before at least one ping was sent (and some more time has passed
afterwards).

Jan
Dario Faggioli May 22, 2023, 8:49 a.m. UTC | #4
On Thu, 2023-05-18 at 17:27 -0400, Stewart Hildebrand wrote:
> On 5/5/23 01:59, Juergen Gross wrote:
> > > 
> > > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> > 
> > Reviewed-by: Juergen Gross <jgross@suse.com>
> 
> Thanks for the review. Does this still need a maintainer ack?
>
Acked-by: Dario Faggioli <dfaggioli@suse.com>

Regards
diff mbox series

Patch

diff --git a/xen/common/sched/null.c b/xen/common/sched/null.c
index 65a0a6c5312d..2091337fcd06 100644
--- a/xen/common/sched/null.c
+++ b/xen/common/sched/null.c
@@ -522,6 +522,8 @@  static void cf_check null_unit_remove(
 {
     struct null_private *prv = null_priv(ops);
     struct null_unit *nvc = null_unit(unit);
+    struct null_pcpu *npc;
+    unsigned int cpu;
     spinlock_t *lock;
 
     ASSERT(!is_idle_unit(unit));
@@ -531,8 +533,6 @@  static void cf_check null_unit_remove(
     /* If offline, the unit shouldn't be assigned, nor in the waitqueue */
     if ( unlikely(!is_unit_online(unit)) )
     {
-        struct null_pcpu *npc;
-
         npc = unit->res->sched_priv;
         ASSERT(npc->unit != unit);
         ASSERT(list_empty(&nvc->waitq_elem));
@@ -549,7 +549,10 @@  static void cf_check null_unit_remove(
         goto out;
     }
 
-    unit_deassign(prv, unit);
+    cpu = sched_unit_master(unit);
+    npc = get_sched_res(cpu)->sched_priv;
+    if ( npc->unit )
+        unit_deassign(prv, unit);
 
  out:
     unit_schedule_unlock_irq(lock, unit);