diff mbox series

[3/6] xen: add new cpu notifier action CPU_RESUME_FAILED

Message ID 20190318131155.29450-4-jgross@suse.com (mailing list archive)
State Superseded
Headers show
Series xen: simplify suspend/resume handling | expand

Commit Message

Jürgen Groß March 18, 2019, 1:11 p.m. UTC
Add a new cpu notifier action CPU_RESUME_FAILED which is called for all
cpus which failed to come up at resume. The calls will be done after
all other cpus are already up.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/common/cpu.c      |  5 +++++
 xen/include/xen/cpu.h | 20 +++++++++++---------
 2 files changed, 16 insertions(+), 9 deletions(-)

Comments

Dario Faggioli March 25, 2019, 12:21 p.m. UTC | #1
On Mon, 2019-03-18 at 14:11 +0100, Juergen Gross wrote:
> --- a/xen/include/xen/cpu.h
> +++ b/xen/include/xen/cpu.h
> @@ -32,23 +32,25 @@ void register_cpu_notifier(struct notifier_block
> *nb);
>   *  (a) A CPU is going down; or (b) CPU_UP_CANCELED
>   */
>  /* CPU_UP_PREPARE: Preparing to bring CPU online. */
> -#define CPU_UP_PREPARE   (0x0001 | NOTIFY_FORWARD)
> +#define CPU_UP_PREPARE    (0x0001 | NOTIFY_FORWARD)
>
In the comment block before these definitions, there's this:

 * Possible event sequences for a given CPU:
 *  CPU_UP_PREPARE -> CPU_UP_CANCELLED           -- failed CPU up
 *  CPU_UP_PREPARE -> CPU_STARTING -> CPU_ONLINE -- successful CPU up
 *  CPU_DOWN_PREPARE -> CPU_DOWN_FAILED          -- failed CPU down
 *  CPU_DOWN_PREPARE -> CPU_DYING -> CPU_DEAD    -- successful CPU down

Shouldn't we add a line for this new hook? Something, IIUIC, like:

 CPU_UP_PREPARE -> CPU_UP_CANCELLED -> CPU_RESUME_FAILED --CPU not resuming

With this, FWIW,

Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

One more (minor) thing...

>  /* CPU_REMOVE: CPU was removed. */
> -#define CPU_REMOVE       (0x0009 | NOTIFY_REVERSE)
> +#define CPU_REMOVE        (0x0009 | NOTIFY_REVERSE)
> +/* CPU_RESUME_FAILED: CPU failed to come up in resume, all other CPUs up. */
> +#define CPU_RESUME_FAILED (0x000a | NOTIFY_REVERSE)
>  
... technically, when we're dealing with CPU_RESUME_FAILED on one CPU,
we don't know if _all_ others really went up, so I think I'd remove
what follows the ','.

Regards,
Dario
Jürgen Groß March 25, 2019, 12:29 p.m. UTC | #2
On 25/03/2019 13:21, Dario Faggioli wrote:
> On Mon, 2019-03-18 at 14:11 +0100, Juergen Gross wrote:
>> --- a/xen/include/xen/cpu.h
>> +++ b/xen/include/xen/cpu.h
>> @@ -32,23 +32,25 @@ void register_cpu_notifier(struct notifier_block
>> *nb);
>>   *  (a) A CPU is going down; or (b) CPU_UP_CANCELED
>>   */
>>  /* CPU_UP_PREPARE: Preparing to bring CPU online. */
>> -#define CPU_UP_PREPARE   (0x0001 | NOTIFY_FORWARD)
>> +#define CPU_UP_PREPARE    (0x0001 | NOTIFY_FORWARD)
>>
> In the comment block before these definitions, there's this:
> 
>  * Possible event sequences for a given CPU:
>  *  CPU_UP_PREPARE -> CPU_UP_CANCELLED           -- failed CPU up
>  *  CPU_UP_PREPARE -> CPU_STARTING -> CPU_ONLINE -- successful CPU up
>  *  CPU_DOWN_PREPARE -> CPU_DOWN_FAILED          -- failed CPU down
>  *  CPU_DOWN_PREPARE -> CPU_DYING -> CPU_DEAD    -- successful CPU down
> 
> Shouldn't we add a line for this new hook? Something, IIUIC, like:
> 
>  CPU_UP_PREPARE -> CPU_UP_CANCELLED -> CPU_RESUME_FAILED --CPU not resuming

Fine with me.

> 
> With this, FWIW,
> 
> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
> 
> One more (minor) thing...
> 
>>  /* CPU_REMOVE: CPU was removed. */
>> -#define CPU_REMOVE       (0x0009 | NOTIFY_REVERSE)
>> +#define CPU_REMOVE        (0x0009 | NOTIFY_REVERSE)
>> +/* CPU_RESUME_FAILED: CPU failed to come up in resume, all other CPUs up. */
>> +#define CPU_RESUME_FAILED (0x000a | NOTIFY_REVERSE)
>>  
> ... technically, when we're dealing with CPU_RESUME_FAILED on one CPU,
> we don't know if _all_ others really went up, so I think I'd remove
> what follows the ','.

The point is that for the CPU_RESUME_FAILED case we can be sure that no
cpu will come up due to resume just a little bit later. So we can test
for e.g. a cpupool suddenly having no more cpus available. This is in
contrast to CPU_UP_CANCELLED being signalled just after the one cpu
failing to come up, but before the next cpu is triggered to come up.


Juergen
George Dunlap March 27, 2019, 3:49 p.m. UTC | #3
On 3/18/19 1:11 PM, Juergen Gross wrote:
> Add a new cpu notifier action CPU_RESUME_FAILED which is called for all
> cpus which failed to come up at resume. The calls will be done after
> all other cpus are already up.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>

Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Dario Faggioli March 27, 2019, 3:54 p.m. UTC | #4
On Mon, 2019-03-25 at 13:29 +0100, Juergen Gross wrote:
> On 25/03/2019 13:21, Dario Faggioli wrote:
>
> > Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
> > 
> > One more (minor) thing...
> > 
> > >  /* CPU_REMOVE: CPU was removed. */
> > > -#define CPU_REMOVE       (0x0009 | NOTIFY_REVERSE)
> > > +#define CPU_REMOVE        (0x0009 | NOTIFY_REVERSE)
> > > +/* CPU_RESUME_FAILED: CPU failed to come up in resume, all other
> > > CPUs up. */
> > > +#define CPU_RESUME_FAILED (0x000a | NOTIFY_REVERSE)
> > >  
> > ... technically, when we're dealing with CPU_RESUME_FAILED on one
> > CPU,
> > we don't know if _all_ others really went up, so I think I'd remove
> > what follows the ','.
> 
> The point is that for the CPU_RESUME_FAILED case we can be sure that
> no
> cpu will come up due to resume just a little bit later.
>
Ah, I see what you mean... that's the fact that this notifier is
invoked from another loop, and although we don't know which CPU did
manage to come up and which don't, we do know that all the ones that
could come up, are up already.

>  So we can test
> for e.g. a cpupool suddenly having no more cpus available. This is in
> contrast to CPU_UP_CANCELLED being signalled just after the one cpu
> failing to come up, but before the next cpu is triggered to come up.
> 
Right.

Well, it still looks to me that "all other CPUs up" is not entirely
accurate. But I can't propose a better alternative (unless we write
something very long, which is probably not worth it).

Perhaps you can explain this a little in another comment, like in
enable_nonboot_cpus(), before the for_each_cpu() loop itself.

But I don't feel too strong about that, and the RoB stands, whatever
you decide to do.

Regards,
Dario
Jan Beulich March 27, 2019, 4:29 p.m. UTC | #5
>>> On 18.03.19 at 14:11, <jgross@suse.com> wrote:
> --- a/xen/common/cpu.c
> +++ b/xen/common/cpu.c
> @@ -214,7 +214,12 @@ void enable_nonboot_cpus(void)
>              printk("Error bringing CPU%d up: %d\n", cpu, error);
>              BUG_ON(error == -EBUSY);
>          }
> +        else
> +            __cpumask_clear_cpu(cpu, &frozen_cpus);
>      }
>  
> +    for_each_cpu ( cpu, &frozen_cpus )
> +        BUG_ON(cpu_notifier_call_chain(cpu, CPU_RESUME_FAILED, NULL));
> +
>      cpumask_clear(&frozen_cpus);
>  }

Is there a particular reason you add a second loop here?

Jan
Jürgen Groß March 27, 2019, 4:32 p.m. UTC | #6
On 27/03/2019 17:29, Jan Beulich wrote:
>>>> On 18.03.19 at 14:11, <jgross@suse.com> wrote:
>> --- a/xen/common/cpu.c
>> +++ b/xen/common/cpu.c
>> @@ -214,7 +214,12 @@ void enable_nonboot_cpus(void)
>>              printk("Error bringing CPU%d up: %d\n", cpu, error);
>>              BUG_ON(error == -EBUSY);
>>          }
>> +        else
>> +            __cpumask_clear_cpu(cpu, &frozen_cpus);
>>      }
>>  
>> +    for_each_cpu ( cpu, &frozen_cpus )
>> +        BUG_ON(cpu_notifier_call_chain(cpu, CPU_RESUME_FAILED, NULL));
>> +
>>      cpumask_clear(&frozen_cpus);
>>  }
> 
> Is there a particular reason you add a second loop here?

Yes, I want to know which cpus did come up again.


Juergen
diff mbox series

Patch

diff --git a/xen/common/cpu.c b/xen/common/cpu.c
index c436c0de7f..f3cf9463b4 100644
--- a/xen/common/cpu.c
+++ b/xen/common/cpu.c
@@ -214,7 +214,12 @@  void enable_nonboot_cpus(void)
             printk("Error bringing CPU%d up: %d\n", cpu, error);
             BUG_ON(error == -EBUSY);
         }
+        else
+            __cpumask_clear_cpu(cpu, &frozen_cpus);
     }
 
+    for_each_cpu ( cpu, &frozen_cpus )
+        BUG_ON(cpu_notifier_call_chain(cpu, CPU_RESUME_FAILED, NULL));
+
     cpumask_clear(&frozen_cpus);
 }
diff --git a/xen/include/xen/cpu.h b/xen/include/xen/cpu.h
index 2fe3ec05d8..2fc0cb1bb5 100644
--- a/xen/include/xen/cpu.h
+++ b/xen/include/xen/cpu.h
@@ -32,23 +32,25 @@  void register_cpu_notifier(struct notifier_block *nb);
  *  (a) A CPU is going down; or (b) CPU_UP_CANCELED
  */
 /* CPU_UP_PREPARE: Preparing to bring CPU online. */
-#define CPU_UP_PREPARE   (0x0001 | NOTIFY_FORWARD)
+#define CPU_UP_PREPARE    (0x0001 | NOTIFY_FORWARD)
 /* CPU_UP_CANCELED: CPU is no longer being brought online. */
-#define CPU_UP_CANCELED  (0x0002 | NOTIFY_REVERSE)
+#define CPU_UP_CANCELED   (0x0002 | NOTIFY_REVERSE)
 /* CPU_STARTING: CPU nearly online. Runs on new CPU, irqs still disabled. */
-#define CPU_STARTING     (0x0003 | NOTIFY_FORWARD)
+#define CPU_STARTING      (0x0003 | NOTIFY_FORWARD)
 /* CPU_ONLINE: CPU is up. */
-#define CPU_ONLINE       (0x0004 | NOTIFY_FORWARD)
+#define CPU_ONLINE        (0x0004 | NOTIFY_FORWARD)
 /* CPU_DOWN_PREPARE: CPU is going down. */
-#define CPU_DOWN_PREPARE (0x0005 | NOTIFY_REVERSE)
+#define CPU_DOWN_PREPARE  (0x0005 | NOTIFY_REVERSE)
 /* CPU_DOWN_FAILED: CPU is no longer going down. */
-#define CPU_DOWN_FAILED  (0x0006 | NOTIFY_FORWARD)
+#define CPU_DOWN_FAILED   (0x0006 | NOTIFY_FORWARD)
 /* CPU_DYING: CPU is nearly dead (in stop_machine context). */
-#define CPU_DYING        (0x0007 | NOTIFY_REVERSE)
+#define CPU_DYING         (0x0007 | NOTIFY_REVERSE)
 /* CPU_DEAD: CPU is dead. */
-#define CPU_DEAD         (0x0008 | NOTIFY_REVERSE)
+#define CPU_DEAD          (0x0008 | NOTIFY_REVERSE)
 /* CPU_REMOVE: CPU was removed. */
-#define CPU_REMOVE       (0x0009 | NOTIFY_REVERSE)
+#define CPU_REMOVE        (0x0009 | NOTIFY_REVERSE)
+/* CPU_RESUME_FAILED: CPU failed to come up in resume, all other CPUs up. */
+#define CPU_RESUME_FAILED (0x000a | NOTIFY_REVERSE)
 
 /* Perform CPU hotplug. May return -EAGAIN. */
 int cpu_down(unsigned int cpu);