xen: arm: Don't use stop_cpu() in halt_this_cpu()

Message ID	20220623074428.226719-1-dmitry.semenets@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> From: dmitry.semenets@gmail.com To: xen-devel@lists.xenproject.org Cc: Dmytro Semenets <dmytro_semenets@epam.com>, Stefano Stabellini <sstabellini@kernel.org>, Julien Grall <julien@xen.org>, Bertrand Marquis <bertrand.marquis@arm.com>, Volodymyr Babchuk <Volodymyr_Babchuk@epam.com> Subject: [PATCH] xen: arm: Don't use stop_cpu() in halt_this_cpu() Date: Thu, 23 Jun 2022 10:44:28 +0300 Message-Id: <20220623074428.226719-1-dmitry.semenets@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	xen: arm: Don't use stop_cpu() in halt_this_cpu() \| expand xen: arm: Don't use stop_cpu() in halt_this_cpu()

Dmytro Semenets June 23, 2022, 7:44 a.m. UTC

From: Dmytro Semenets <dmytro_semenets@epam.com>

When shutting down (or rebooting) the platform, Xen will call stop_cpu()
on all the CPUs but one. The last CPU will then request the system to
shutdown/restart.

On platform using PSCI, stop_cpu() will call PSCI CPU off. Per the spec
(section 5.5.2 DEN0022D.b), the call could return DENIED if the Trusted
OS is resident on the CPU that is about to be turned off.

As Xen doesn't migrate off the trusted OS (which BTW may not be
migratable), it would be possible to hit the panic().

In the ideal situation, Xen should migrate the trusted OS or make sure
the CPU off is not called. However, when shutting down (or rebooting)
the platform, it is pointless to try to turn off all the CPUs (per
section 5.10.2, it is only required to put the core in a known state).

So solve the problem by open-coding stop_cpu() in halt_this_cpu() and
not call PSCI CPU off.

Signed-off-by: Dmytro Semenets <dmytro_semenets@epam.com>
---
 xen/arch/arm/shutdown.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Stefano Stabellini June 23, 2022, 10:07 p.m. UTC | #1

On Thu, 23 Jun 2022, dmitry.semenets@gmail.com wrote:
> From: Dmytro Semenets <dmytro_semenets@epam.com>
> 
> When shutting down (or rebooting) the platform, Xen will call stop_cpu()
> on all the CPUs but one. The last CPU will then request the system to
> shutdown/restart.
> 
> On platform using PSCI, stop_cpu() will call PSCI CPU off. Per the spec
> (section 5.5.2 DEN0022D.b), the call could return DENIED if the Trusted
> OS is resident on the CPU that is about to be turned off.
> 
> As Xen doesn't migrate off the trusted OS (which BTW may not be
> migratable), it would be possible to hit the panic().
> 
> In the ideal situation, Xen should migrate the trusted OS or make sure
> the CPU off is not called. However, when shutting down (or rebooting)
> the platform, it is pointless to try to turn off all the CPUs (per
> section 5.10.2, it is only required to put the core in a known state).
> 
> So solve the problem by open-coding stop_cpu() in halt_this_cpu() and
> not call PSCI CPU off.
> 
> Signed-off-by: Dmytro Semenets <dmytro_semenets@epam.com>
> ---
>  xen/arch/arm/shutdown.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/arm/shutdown.c b/xen/arch/arm/shutdown.c
> index 3dc6819d56..a9aea19e8e 100644
> --- a/xen/arch/arm/shutdown.c
> +++ b/xen/arch/arm/shutdown.c
> @@ -8,7 +8,12 @@
>  
>  static void noreturn halt_this_cpu(void *arg)
>  {
> -    stop_cpu();
> +    local_irq_disable();
> +    /* Make sure the write happens before we sleep forever */
> +    dsb(sy);
> +    isb();
> +    while ( 1 )
> +        wfi();
>  }

stop_cpu has already a wfi loop just after the psci call:

    call_psci_cpu_off();

    while ( 1 )
        wfi();

So wouldn't it be better to remove the panic from the implementation of
call_psci_cpu_off?

The reason I am saying this is that stop_cpu is called in a number of
places beyond halt_this_cpu and as far as I can tell any of them could
trigger the panic. (I admit they are unlikely places but still.)

Also the PSCI spec explicitely mention CPU_OFF as a way to place CPUs in
a "known state" and doesn't offer any other examples. So although what
you wrote in the commit message is correct, using CPU_OFF seems to be
the less error prone way (in the sense of triggering firmware errors) to
place CPUs in a known state.

So I would simply remove the panic from call_psci_cpu_off, so that we
try CPU_OFF first, and if it doesn't work, we use the WFI loop as
backup. Also we get to fix all the other callers of stop_cpu this way.

Julien Grall June 24, 2022, 8:53 a.m. UTC | #2

Hi Stefano,

On 23/06/2022 23:07, Stefano Stabellini wrote:
> On Thu, 23 Jun 2022, dmitry.semenets@gmail.com wrote:
>> From: Dmytro Semenets <dmytro_semenets@epam.com>
> So wouldn't it be better to remove the panic from the implementation of
> call_psci_cpu_off?

I have asked to keep the panic() in call_psci_cpu_off(). If you remove 
the panic() then we will hide the fact that the CPU was not properly 
turned off and will consume more energy than expected.

The WFI loop is fine when shutting down or rebooting because we know 
this will only happen for a short period of time.

> 
> The reason I am saying this is that stop_cpu is called in a number of
> places beyond halt_this_cpu and as far as I can tell any of them could
> trigger the panic. (I admit they are unlikely places but still.)

This is one of the example where the CPU will not be stopped for a short 
period of time. We should deal with them differently (i.e. migrating the 
trusted OS or else) so we give all the chance for the CPU to be fully 
powered.

IMHO, this is a different issue and hence why I didn't ask Dmitry to 
solve it.

> 
> 
> Also the PSCI spec explicitely mention CPU_OFF as a way to place CPUs in
> a "known state" and doesn't offer any other examples. So although what
> you wrote in the commit message is correct, using CPU_OFF seems to be
> the less error prone way (in the sense of triggering firmware errors) to
> place CPUs in a known state.

The section you are referring to is starting with "One way". This imply 
there are others methods.

FWIW, the spin loop above seems to be how Linux is dealing with the 
shutdown/reboot.

> 
> So I would simply remove the panic from call_psci_cpu_off, so that we
> try CPU_OFF first, and if it doesn't work, we use the WFI loop as
> backup. Also we get to fix all the other callers of stop_cpu this way.
This reads strange. In the previous paragraph you suggested the CPU off 
is a less error prone way to place CPUs in a known state. But here, you 
are softening the stance and suggesting to fallback to the WFI loop.

So to me this indicates that WFI loop is fine. Otherwise, wouldn't the 
user risk to see firmware errors (which BTW, I don't understand what 
sort of firmware errors you are worried)? If yes, why would it be 
acceptable?

For instance, Dmitry situation, the CPU0 would always WFI loop...

Cheers,

Stefano Stabellini June 24, 2022, 9:31 p.m. UTC | #3

On Fri, 24 Jun 2022, Julien Grall wrote:
> On 23/06/2022 23:07, Stefano Stabellini wrote:
> > On Thu, 23 Jun 2022, dmitry.semenets@gmail.com wrote:
> > > From: Dmytro Semenets <dmytro_semenets@epam.com>
> > So wouldn't it be better to remove the panic from the implementation of
> > call_psci_cpu_off?
> 
> I have asked to keep the panic() in call_psci_cpu_off(). If you remove the
> panic() then we will hide the fact that the CPU was not properly turned off
> and will consume more energy than expected.
> 
> The WFI loop is fine when shutting down or rebooting because we know this will
> only happen for a short period of time.

Yeah, I don't think we should hide that CPU_OFF failed. I think we
should print a warning. However, given that we know CPU_OFF can
reasonably fail in normal conditions returning DENIED when a Trusted OS
is present, then I think we should not panic.

If there was a way to distinguish a failure because a Trusted OS is
present (the "normal" failure) from other failures, I would suggest to:
- print a warning if failed due to a Trusted OS being present
- panic in other cases

Unfortunately it looks like in all cases the return code is DENIED :-(


Given that, I would not panic and only print a warning in all cases. Or
we could ASSERT which at least goes away in !DEBUG builds.


> > The reason I am saying this is that stop_cpu is called in a number of
> > places beyond halt_this_cpu and as far as I can tell any of them could
> > trigger the panic. (I admit they are unlikely places but still.)
> 
> This is one of the example where the CPU will not be stopped for a short
> period of time. We should deal with them differently (i.e. migrating the
> trusted OS or else) so we give all the chance for the CPU to be fully powered.
> 
> IMHO, this is a different issue and hence why I didn't ask Dmitry to solve it.

I see your point now. I was seeing the two things as one.

I think it is true that the WFI loop is likely to work. Also it is true
that from a power perspective it makes no different on power down or
reboot.  From that point of view this patch is OK.

But even on shut-down/reboot, why not do that as a fallback in case
CPU_OFF didn't work? It is going to work most of the times anyway, why
change the default for the few cases that doesn't work?

Given that this patch would work, I don't want to insist on this and let
you decide.


But even if we don't want to remove the panic as part of this patch, I
think we should remove the panic in a separate patch anyway, at least
until someone investigates and thinks of a strategy how to migrate the
TrustedOS as you suggested.


 
 
> > Also the PSCI spec explicitely mention CPU_OFF as a way to place CPUs in
> > a "known state" and doesn't offer any other examples. So although what
> > you wrote in the commit message is correct, using CPU_OFF seems to be
> > the less error prone way (in the sense of triggering firmware errors) to
> > place CPUs in a known state.
> 
> The section you are referring to is starting with "One way". This imply there
> are others methods.
> 
> FWIW, the spin loop above seems to be how Linux is dealing with the
> shutdown/reboot.
> 
> > 
> > So I would simply remove the panic from call_psci_cpu_off, so that we
> > try CPU_OFF first, and if it doesn't work, we use the WFI loop as
> > backup. Also we get to fix all the other callers of stop_cpu this way.
> This reads strange. In the previous paragraph you suggested the CPU off is a
> less error prone way to place CPUs in a known state. But here, you are
> softening the stance and suggesting to fallback to the WFI loop.
> 
> So to me this indicates that WFI loop is fine. Otherwise, wouldn't the user
> risk to see firmware errors (which BTW, I don't understand what sort of
> firmware errors you are worried)? If yes, why would it be acceptable?
> 
> For instance, Dmitry situation, the CPU0 would always WFI loop...

Julien Grall June 25, 2022, 2:45 p.m. UTC | #4

Hi Stefano,

On 24/06/2022 22:31, Stefano Stabellini wrote:
> On Fri, 24 Jun 2022, Julien Grall wrote:
>> On 23/06/2022 23:07, Stefano Stabellini wrote:
>>> On Thu, 23 Jun 2022, dmitry.semenets@gmail.com wrote:
>>>> From: Dmytro Semenets <dmytro_semenets@epam.com>
>>> So wouldn't it be better to remove the panic from the implementation of
>>> call_psci_cpu_off?
>>
>> I have asked to keep the panic() in call_psci_cpu_off(). If you remove the
>> panic() then we will hide the fact that the CPU was not properly turned off
>> and will consume more energy than expected.
>>
>> The WFI loop is fine when shutting down or rebooting because we know this will
>> only happen for a short period of time.
> 
> Yeah, I don't think we should hide that CPU_OFF failed. I think we
> should print a warning. However, given that we know CPU_OFF can
> reasonably fail in normal conditions returning DENIED when a Trusted OS
> is present, then I think we should not panic.

We know how to detect this condition (see section 5.9 in DEN0022D.b). So 
I would argue we should fix it properly rather than removing the panic().

> 
> If there was a way to distinguish a failure because a Trusted OS is
> present (the "normal" failure) from other failures, I would suggest to:
> - print a warning if failed due to a Trusted OS being present
> - panic in other cases
> 
> Unfortunately it looks like in all cases the return code is DENIED :-(
I am confused. Per the spec, the only reason CPU_OFF can return DENIED 
is because the Trusted OS is resident. So what other cases are you 
talking about?

> 
> 
> Given that, I would not panic and only print a warning in all cases. Or
> we could ASSERT which at least goes away in !DEBUG builds.

ASSERT() is definitely not way to deal with external input. I could 
possibly accept a WARN(), but see above.

>>> The reason I am saying this is that stop_cpu is called in a number of
>>> places beyond halt_this_cpu and as far as I can tell any of them could
>>> trigger the panic. (I admit they are unlikely places but still.)
>>
>> This is one of the example where the CPU will not be stopped for a short
>> period of time. We should deal with them differently (i.e. migrating the
>> trusted OS or else) so we give all the chance for the CPU to be fully powered.
>>
>> IMHO, this is a different issue and hence why I didn't ask Dmitry to solve it.
> 
> I see your point now. I was seeing the two things as one.
> 
> I think it is true that the WFI loop is likely to work. Also it is true
> that from a power perspective it makes no different on power down or
> reboot.  From that point of view this patch is OK.
> 
> But even on shut-down/reboot, why not do that as a fallback in case
> CPU_OFF didn't work? It is going to work most of the times anyway, why
> change the default for the few cases that doesn't work?

Because we would not be consistent how we would turn off a CPU on a 
system supporting PSCI. I would prefer to use the same method everywhere 
so it is easier to reason about.

I am also not sure how you define "most of the time". Yes it is possible 
that the boards we aware of will not have this issue, but how about the 
one we don't know?

> 
> Given that this patch would work, I don't want to insist on this and let
> you decide.
> 
> 
> But even if we don't want to remove the panic as part of this patch, I
> think we should remove the panic in a separate patch anyway, at least
> until someone investigates and thinks of a strategy how to migrate the
> TrustedOS as you suggested.
If we accept this patch, then we remove the immediate pain. The other 
uses of stop_cpu() are in:
	1) idle_loop(), this is reachable when turning off a CPU after boot 
(not supported on Arm)
         2) start_secondary(), this is only used during boot (CPU 
hot-unplug is not supported)

Even if it would be possible to trigger the panic() in 2), I am not 
aware of an immediate issue there. So I think it would be the wrong 
approach to remove the panic() first and then investigate.

The advantage of the panic() is it will remind us that some needs to be 
fixed. With a warning (or WARN()) people will tend to ignore it.

Cheers,

Dmytro Semenets June 28, 2022, 9:54 p.m. UTC | #5

Hi Stefano and Julien,
What is the conclusion about this patch?

сб, 25 июн. 2022 г. в 17:45, Julien Grall <julien@xen.org>:
>
> Hi Stefano,
>
> On 24/06/2022 22:31, Stefano Stabellini wrote:
> > On Fri, 24 Jun 2022, Julien Grall wrote:
> >> On 23/06/2022 23:07, Stefano Stabellini wrote:
> >>> On Thu, 23 Jun 2022, dmitry.semenets@gmail.com wrote:
> >>>> From: Dmytro Semenets <dmytro_semenets@epam.com>
> >>> So wouldn't it be better to remove the panic from the implementation of
> >>> call_psci_cpu_off?
> >>
> >> I have asked to keep the panic() in call_psci_cpu_off(). If you remove the
> >> panic() then we will hide the fact that the CPU was not properly turned off
> >> and will consume more energy than expected.
> >>
> >> The WFI loop is fine when shutting down or rebooting because we know this will
> >> only happen for a short period of time.
> >
> > Yeah, I don't think we should hide that CPU_OFF failed. I think we
> > should print a warning. However, given that we know CPU_OFF can
> > reasonably fail in normal conditions returning DENIED when a Trusted OS
> > is present, then I think we should not panic.
>
> We know how to detect this condition (see section 5.9 in DEN0022D.b). So
> I would argue we should fix it properly rather than removing the panic().
>
> >
> > If there was a way to distinguish a failure because a Trusted OS is
> > present (the "normal" failure) from other failures, I would suggest to:
> > - print a warning if failed due to a Trusted OS being present
> > - panic in other cases
> >
> > Unfortunately it looks like in all cases the return code is DENIED :-(
> I am confused. Per the spec, the only reason CPU_OFF can return DENIED
> is because the Trusted OS is resident. So what other cases are you
> talking about?
>
> >
> >
> > Given that, I would not panic and only print a warning in all cases. Or
> > we could ASSERT which at least goes away in !DEBUG builds.
>
> ASSERT() is definitely not way to deal with external input. I could
> possibly accept a WARN(), but see above.
>
> >>> The reason I am saying this is that stop_cpu is called in a number of
> >>> places beyond halt_this_cpu and as far as I can tell any of them could
> >>> trigger the panic. (I admit they are unlikely places but still.)
> >>
> >> This is one of the example where the CPU will not be stopped for a short
> >> period of time. We should deal with them differently (i.e. migrating the
> >> trusted OS or else) so we give all the chance for the CPU to be fully powered.
> >>
> >> IMHO, this is a different issue and hence why I didn't ask Dmitry to solve it.
> >
> > I see your point now. I was seeing the two things as one.
> >
> > I think it is true that the WFI loop is likely to work. Also it is true
> > that from a power perspective it makes no different on power down or
> > reboot.  From that point of view this patch is OK.
> >
> > But even on shut-down/reboot, why not do that as a fallback in case
> > CPU_OFF didn't work? It is going to work most of the times anyway, why
> > change the default for the few cases that doesn't work?
>
> Because we would not be consistent how we would turn off a CPU on a
> system supporting PSCI. I would prefer to use the same method everywhere
> so it is easier to reason about.
>
> I am also not sure how you define "most of the time". Yes it is possible
> that the boards we aware of will not have this issue, but how about the
> one we don't know?
>
> >
> > Given that this patch would work, I don't want to insist on this and let
> > you decide.
> >
> >
> > But even if we don't want to remove the panic as part of this patch, I
> > think we should remove the panic in a separate patch anyway, at least
> > until someone investigates and thinks of a strategy how to migrate the
> > TrustedOS as you suggested.
> If we accept this patch, then we remove the immediate pain. The other
> uses of stop_cpu() are in:
>         1) idle_loop(), this is reachable when turning off a CPU after boot
> (not supported on Arm)
>          2) start_secondary(), this is only used during boot (CPU
> hot-unplug is not supported)
>
> Even if it would be possible to trigger the panic() in 2), I am not
> aware of an immediate issue there. So I think it would be the wrong
> approach to remove the panic() first and then investigate.
>
> The advantage of the panic() is it will remind us that some needs to be
> fixed. With a warning (or WARN()) people will tend to ignore it.
>
> Cheers,
>
> --
> Julien Grall

Stefano Stabellini June 28, 2022, 10:56 p.m. UTC | #6

On Sat, 25 Jun 2022, Julien Grall wrote:
> On 24/06/2022 22:31, Stefano Stabellini wrote:
> > On Fri, 24 Jun 2022, Julien Grall wrote:
> > > On 23/06/2022 23:07, Stefano Stabellini wrote:
> > > > On Thu, 23 Jun 2022, dmitry.semenets@gmail.com wrote:
> > > > > From: Dmytro Semenets <dmytro_semenets@epam.com>
> > > > So wouldn't it be better to remove the panic from the implementation of
> > > > call_psci_cpu_off?
> > > 
> > > I have asked to keep the panic() in call_psci_cpu_off(). If you remove the
> > > panic() then we will hide the fact that the CPU was not properly turned
> > > off
> > > and will consume more energy than expected.
> > > 
> > > The WFI loop is fine when shutting down or rebooting because we know this
> > > will
> > > only happen for a short period of time.
> > 
> > Yeah, I don't think we should hide that CPU_OFF failed. I think we
> > should print a warning. However, given that we know CPU_OFF can
> > reasonably fail in normal conditions returning DENIED when a Trusted OS
> > is present, then I think we should not panic.
> 
> We know how to detect this condition (see section 5.9 in DEN0022D.b). So I
> would argue we should fix it properly rather than removing the panic().
> 
> > 
> > If there was a way to distinguish a failure because a Trusted OS is
> > present (the "normal" failure) from other failures, I would suggest to:
> > - print a warning if failed due to a Trusted OS being present
> > - panic in other cases
> > 
> > Unfortunately it looks like in all cases the return code is DENIED :-(
> I am confused. Per the spec, the only reason CPU_OFF can return DENIED is
> because the Trusted OS is resident. So what other cases are you talking about?
> 
> > 
> > 
> > Given that, I would not panic and only print a warning in all cases. Or
> > we could ASSERT which at least goes away in !DEBUG builds.
> 
> ASSERT() is definitely not way to deal with external input. I could possibly
> accept a WARN(), but see above.
> 
> > > > The reason I am saying this is that stop_cpu is called in a number of
> > > > places beyond halt_this_cpu and as far as I can tell any of them could
> > > > trigger the panic. (I admit they are unlikely places but still.)
> > > 
> > > This is one of the example where the CPU will not be stopped for a short
> > > period of time. We should deal with them differently (i.e. migrating the
> > > trusted OS or else) so we give all the chance for the CPU to be fully
> > > powered.
> > > 
> > > IMHO, this is a different issue and hence why I didn't ask Dmitry to solve
> > > it.
> > 
> > I see your point now. I was seeing the two things as one.
> > 
> > I think it is true that the WFI loop is likely to work. Also it is true
> > that from a power perspective it makes no different on power down or
> > reboot.  From that point of view this patch is OK.
> > 
> > But even on shut-down/reboot, why not do that as a fallback in case
> > CPU_OFF didn't work? It is going to work most of the times anyway, why
> > change the default for the few cases that doesn't work?
> 
> Because we would not be consistent how we would turn off a CPU on a system
> supporting PSCI. I would prefer to use the same method everywhere so it is
> easier to reason about.
> 
> I am also not sure how you define "most of the time". Yes it is possible that
> the boards we aware of will not have this issue, but how about the one we
> don't know?
> 
> > 
> > Given that this patch would work, I don't want to insist on this and let
> > you decide.
> > 
> > 
> > But even if we don't want to remove the panic as part of this patch, I
> > think we should remove the panic in a separate patch anyway, at least
> > until someone investigates and thinks of a strategy how to migrate the
> > TrustedOS as you suggested.
> If we accept this patch, then we remove the immediate pain. The other uses of
> stop_cpu() are in:
> 	1) idle_loop(), this is reachable when turning off a CPU after boot
> (not supported on Arm)
>         2) start_secondary(), this is only used during boot (CPU hot-unplug is
> not supported)
> 
> Even if it would be possible to trigger the panic() in 2), I am not aware of
> an immediate issue there. So I think it would be the wrong approach to remove
> the panic() first and then investigate.
> 
> The advantage of the panic() is it will remind us that some needs to be fixed.
> With a warning (or WARN()) people will tend to ignore it.

I know that this specific code path (cpu off) is probably not super
relevant for what I am about to say, but as we move closer to safety
certifiability we need to get away from using "panic" and BUG_ON as a
reminder that more work is needed to have a fully correct implementation
of something.

I also see your point and agree that ASSERT is not acceptable for
external input but from my point of view panic is the same (slightly
worse because it doesn't go away in production builds).

The return value of CPU_OFF is "external input" but this patch would
make that problem go away for halt_this_cpu, and the other two call
sites are only relevant during boot.

So, although this is not my preference, I don't want to block this
patch. (I also think it is a lot better to move faster as a project
even with not-ideal implementations.)

Julien if you are going to ack the patch feel free to go ahead.

Stefano Stabellini June 28, 2022, 10:57 p.m. UTC | #7

I'll let Julien ack (and commit) the patch.


On Wed, 29 Jun 2022, Dmytro Semenets wrote:
> Hi Stefano and Julien,
> What is the conclusion about this patch?
> 
> сб, 25 июн. 2022 г. в 17:45, Julien Grall <julien@xen.org>:
> >
> > Hi Stefano,
> >
> > On 24/06/2022 22:31, Stefano Stabellini wrote:
> > > On Fri, 24 Jun 2022, Julien Grall wrote:
> > >> On 23/06/2022 23:07, Stefano Stabellini wrote:
> > >>> On Thu, 23 Jun 2022, dmitry.semenets@gmail.com wrote:
> > >>>> From: Dmytro Semenets <dmytro_semenets@epam.com>
> > >>> So wouldn't it be better to remove the panic from the implementation of
> > >>> call_psci_cpu_off?
> > >>
> > >> I have asked to keep the panic() in call_psci_cpu_off(). If you remove the
> > >> panic() then we will hide the fact that the CPU was not properly turned off
> > >> and will consume more energy than expected.
> > >>
> > >> The WFI loop is fine when shutting down or rebooting because we know this will
> > >> only happen for a short period of time.
> > >
> > > Yeah, I don't think we should hide that CPU_OFF failed. I think we
> > > should print a warning. However, given that we know CPU_OFF can
> > > reasonably fail in normal conditions returning DENIED when a Trusted OS
> > > is present, then I think we should not panic.
> >
> > We know how to detect this condition (see section 5.9 in DEN0022D.b). So
> > I would argue we should fix it properly rather than removing the panic().
> >
> > >
> > > If there was a way to distinguish a failure because a Trusted OS is
> > > present (the "normal" failure) from other failures, I would suggest to:
> > > - print a warning if failed due to a Trusted OS being present
> > > - panic in other cases
> > >
> > > Unfortunately it looks like in all cases the return code is DENIED :-(
> > I am confused. Per the spec, the only reason CPU_OFF can return DENIED
> > is because the Trusted OS is resident. So what other cases are you
> > talking about?
> >
> > >
> > >
> > > Given that, I would not panic and only print a warning in all cases. Or
> > > we could ASSERT which at least goes away in !DEBUG builds.
> >
> > ASSERT() is definitely not way to deal with external input. I could
> > possibly accept a WARN(), but see above.
> >
> > >>> The reason I am saying this is that stop_cpu is called in a number of
> > >>> places beyond halt_this_cpu and as far as I can tell any of them could
> > >>> trigger the panic. (I admit they are unlikely places but still.)
> > >>
> > >> This is one of the example where the CPU will not be stopped for a short
> > >> period of time. We should deal with them differently (i.e. migrating the
> > >> trusted OS or else) so we give all the chance for the CPU to be fully powered.
> > >>
> > >> IMHO, this is a different issue and hence why I didn't ask Dmitry to solve it.
> > >
> > > I see your point now. I was seeing the two things as one.
> > >
> > > I think it is true that the WFI loop is likely to work. Also it is true
> > > that from a power perspective it makes no different on power down or
> > > reboot.  From that point of view this patch is OK.
> > >
> > > But even on shut-down/reboot, why not do that as a fallback in case
> > > CPU_OFF didn't work? It is going to work most of the times anyway, why
> > > change the default for the few cases that doesn't work?
> >
> > Because we would not be consistent how we would turn off a CPU on a
> > system supporting PSCI. I would prefer to use the same method everywhere
> > so it is easier to reason about.
> >
> > I am also not sure how you define "most of the time". Yes it is possible
> > that the boards we aware of will not have this issue, but how about the
> > one we don't know?
> >
> > >
> > > Given that this patch would work, I don't want to insist on this and let
> > > you decide.
> > >
> > >
> > > But even if we don't want to remove the panic as part of this patch, I
> > > think we should remove the panic in a separate patch anyway, at least
> > > until someone investigates and thinks of a strategy how to migrate the
> > > TrustedOS as you suggested.
> > If we accept this patch, then we remove the immediate pain. The other
> > uses of stop_cpu() are in:
> >         1) idle_loop(), this is reachable when turning off a CPU after boot
> > (not supported on Arm)
> >          2) start_secondary(), this is only used during boot (CPU
> > hot-unplug is not supported)
> >
> > Even if it would be possible to trigger the panic() in 2), I am not
> > aware of an immediate issue there. So I think it would be the wrong
> > approach to remove the panic() first and then investigate.
> >
> > The advantage of the panic() is it will remind us that some needs to be
> > fixed. With a warning (or WARN()) people will tend to ignore it.
> >
> > Cheers,
> >
> > --
> > Julien Grall
>

Julien Grall June 29, 2022, 8:23 a.m. UTC | #8

Hi Stefano,

On 28/06/2022 23:56, Stefano Stabellini wrote:
>> The advantage of the panic() is it will remind us that some needs to be fixed.
>> With a warning (or WARN()) people will tend to ignore it.
> 
> I know that this specific code path (cpu off) is probably not super
> relevant for what I am about to say, but as we move closer to safety
> certifiability we need to get away from using "panic" and BUG_ON as a
> reminder that more work is needed to have a fully correct implementation
> of something.

I don't think we have many places at runtime using BUG_ON()/panic(). 
They are often used because we think Xen would not be able to recover if 
the condition is hit.

I am happy to remove them, but this should not be at the expense to 
introduce other potential weird bugs.

> 
> I also see your point and agree that ASSERT is not acceptable for
> external input but from my point of view panic is the same (slightly
> worse because it doesn't go away in production builds).

I think it depends on your target. Would you be happy if Xen continue to 
run with potentially a fatal flaw?

> 
> Julien if you are going to ack the patch feel free to go ahead.

I will do and commit it.

Cheers,

Stefano Stabellini June 29, 2022, 5:19 p.m. UTC | #9

On Wed, 29 Jun 2022, Julien Grall wrote:
> On 28/06/2022 23:56, Stefano Stabellini wrote:
> > > The advantage of the panic() is it will remind us that some needs to be
> > > fixed.
> > > With a warning (or WARN()) people will tend to ignore it.
> > 
> > I know that this specific code path (cpu off) is probably not super
> > relevant for what I am about to say, but as we move closer to safety
> > certifiability we need to get away from using "panic" and BUG_ON as a
> > reminder that more work is needed to have a fully correct implementation
> > of something.
> 
> I don't think we have many places at runtime using BUG_ON()/panic(). They are
> often used because we think Xen would not be able to recover if the condition
> is hit.
> 
> I am happy to remove them, but this should not be at the expense to introduce
> other potential weird bugs.
> 
> > 
> > I also see your point and agree that ASSERT is not acceptable for
> > external input but from my point of view panic is the same (slightly
> > worse because it doesn't go away in production builds).
> 
> I think it depends on your target. Would you be happy if Xen continue to run
> with potentially a fatal flaw?

Actually, this is an excellent question. I don't know what is the
expected behavior from a safety perspective in case of serious errors.
How the error should be reported and whether continuing or not is
recommended. I'll try to find out more information.

Bertrand Marquis June 30, 2022, 7:16 a.m. UTC | #10

Hi,

> On 29 Jun 2022, at 18:19, Stefano Stabellini <sstabellini@kernel.org> wrote:
> 
> On Wed, 29 Jun 2022, Julien Grall wrote:
>> On 28/06/2022 23:56, Stefano Stabellini wrote:
>>>> The advantage of the panic() is it will remind us that some needs to be
>>>> fixed.
>>>> With a warning (or WARN()) people will tend to ignore it.
>>> 
>>> I know that this specific code path (cpu off) is probably not super
>>> relevant for what I am about to say, but as we move closer to safety
>>> certifiability we need to get away from using "panic" and BUG_ON as a
>>> reminder that more work is needed to have a fully correct implementation
>>> of something.
>> 
>> I don't think we have many places at runtime using BUG_ON()/panic(). They are
>> often used because we think Xen would not be able to recover if the condition
>> is hit.
>> 
>> I am happy to remove them, but this should not be at the expense to introduce
>> other potential weird bugs.
>> 
>>> 
>>> I also see your point and agree that ASSERT is not acceptable for
>>> external input but from my point of view panic is the same (slightly
>>> worse because it doesn't go away in production builds).
>> 
>> I think it depends on your target. Would you be happy if Xen continue to run
>> with potentially a fatal flaw?
> 
> Actually, this is an excellent question. I don't know what is the
> expected behavior from a safety perspective in case of serious errors.
> How the error should be reported and whether continuing or not is
> recommended. I'll try to find out more information.

I think there are 2 answers to this:
- as much as possible: those case must be avoided and it must be demonstrated that they are impossible and hence removed or turn the system in a failsafe mode so that actions can be handle (usually reboot after saving some data)
- in some cases this can be robustness code (more for security)

I think in our case that if we know that we are ending in a case where the system is unstable we should:
- stop the guest responsible for this (if a guest is the origin) or return an error to the guest and cancel the operation if suitable
- panic if this is internal or dom0

A warning informing that something not supported was done and ending in an unexpected behaviour is for sure not acceptable.

Cheers
Bertrand

Julien Grall June 30, 2022, 6:41 p.m. UTC | #11

Hi Dmitry,

On 23/06/2022 08:44, dmitry.semenets@gmail.com wrote:
> From: Dmytro Semenets <dmytro_semenets@epam.com>
> 
> When shutting down (or rebooting) the platform, Xen will call stop_cpu()
> on all the CPUs but one. The last CPU will then request the system to
> shutdown/restart.
> 
> On platform using PSCI, stop_cpu() will call PSCI CPU off. Per the spec
> (section 5.5.2 DEN0022D.b), the call could return DENIED if the Trusted
> OS is resident on the CPU that is about to be turned off.
> 
> As Xen doesn't migrate off the trusted OS (which BTW may not be
> migratable), it would be possible to hit the panic().
> 
> In the ideal situation, Xen should migrate the trusted OS or make sure
> the CPU off is not called. However, when shutting down (or rebooting)
> the platform, it is pointless to try to turn off all the CPUs (per
> section 5.10.2, it is only required to put the core in a known state).
> 
> So solve the problem by open-coding stop_cpu() in halt_this_cpu() and
> not call PSCI CPU off.
> 
> Signed-off-by: Dmytro Semenets <dmytro_semenets@epam.com>

Acked-by: Julien Grall <jgrall@amazon.com>

And committed.

Cheers,

Stefano Stabellini June 30, 2022, 9:14 p.m. UTC | #12

On Thu, 30 Jun 2022, Bertrand Marquis wrote:
> > On 29 Jun 2022, at 18:19, Stefano Stabellini <sstabellini@kernel.org> wrote:
> > On Wed, 29 Jun 2022, Julien Grall wrote:
> >> On 28/06/2022 23:56, Stefano Stabellini wrote:
> >>>> The advantage of the panic() is it will remind us that some needs to be
> >>>> fixed.
> >>>> With a warning (or WARN()) people will tend to ignore it.
> >>> 
> >>> I know that this specific code path (cpu off) is probably not super
> >>> relevant for what I am about to say, but as we move closer to safety
> >>> certifiability we need to get away from using "panic" and BUG_ON as a
> >>> reminder that more work is needed to have a fully correct implementation
> >>> of something.
> >> 
> >> I don't think we have many places at runtime using BUG_ON()/panic(). They are
> >> often used because we think Xen would not be able to recover if the condition
> >> is hit.
> >> 
> >> I am happy to remove them, but this should not be at the expense to introduce
> >> other potential weird bugs.
> >> 
> >>> 
> >>> I also see your point and agree that ASSERT is not acceptable for
> >>> external input but from my point of view panic is the same (slightly
> >>> worse because it doesn't go away in production builds).
> >> 
> >> I think it depends on your target. Would you be happy if Xen continue to run
> >> with potentially a fatal flaw?
> > 
> > Actually, this is an excellent question. I don't know what is the
> > expected behavior from a safety perspective in case of serious errors.
> > How the error should be reported and whether continuing or not is
> > recommended. I'll try to find out more information.
> 
> I think there are 2 answers to this:
> - as much as possible: those case must be avoided and it must be demonstrated that they are impossible and hence removed or turn the system in a failsafe mode so that actions can be handle (usually reboot after saving some data)
> - in some cases this can be robustness code (more for security)
> 
> I think in our case that if we know that we are ending in a case where the system is unstable we should:
> - stop the guest responsible for this (if a guest is the origin) or return an error to the guest and cancel the operation if suitable
> - panic if this is internal or dom0
> 
> A warning informing that something not supported was done and ending in an unexpected behaviour is for sure not acceptable.


Let's say that we demonstrate that a problematic case is impossible, can
we still have a panic in the code? For instance:

ret = firmware_call();
if (ret)
    panic();

We know ret is always zero unless firmware is buggy or not
spec-compliant. Can the panic() still be present?


And/or do we need to replace all instances of "panic" with going into
"failsafe mode", which saves state and reboots so it is not so
dissimilar from panic actually?


In case of guest-initiated unexpected errors we already try to crash the
guest responsible and not crash the entire system because it is also a
matter of security (possible DOS). That is clear.

So it is other kind of unexpected errors, mostly due to hardware or
firmware unexpected behavior or Xen finding itself in state of a state
machine that should be impossible. Those are the ones we don't have a
clear way to proceed.

xen: arm: Don't use stop_cpu() in halt_this_cpu()

Commit Message

Comments

Patch