mbox series

[0/6] KVM: arm64: Implement PSCI SYSTEM_SUSPEND support

Message ID 20210819223640.3564975-1-oupton@google.com (mailing list archive)
Headers show
Series KVM: arm64: Implement PSCI SYSTEM_SUSPEND support | expand

Message

Oliver Upton Aug. 19, 2021, 10:36 p.m. UTC
Certain VMMs/operators may wish to give their guests the ability to
initiate a system suspend that could result in the VM being saved to
persistent storage to be resumed at a later time. The PSCI v1.0
specification describes an SMC, SYSTEM_SUSPEND, that allows a kernel to
request a system suspend. This call is optional for v1.0, and KVM
elected to not support the call in its v1.0 implementation.

This series adds support for the SYSTEM_SUSPEND PSCI call to KVM/arm64.
Since this is a system-scoped event, KVM cannot quiesce the VM on its
own. We add a new system exit type in this series to clue in userspace
that a suspend was requested. Per the KVM_EXIT_SYSTEM_EVENT ABI, a VMM
that doesn't care about this event can simply resume the guest without
issue (we set up the calling vCPU to come out of reset correctly on next
KVM_RUN).

Patch 1 is unrelated, and is a fix for "KVM: arm64: Enforce reserved
bits for PSCI target affinities" on the kvmarm/next branch. Nothing
particularly hairy, just an unused param.

Patch 2 simplifies the function to check if KVM allows a particular PSCI
function. We can generally disallow any PSCI function that sets the
SMC64 bit in the PSCI function ID.

Patch 3 wraps up the PSCI reset logic used for CPU_ON, which will be
needed later to queue up a reset on the vCPU that requested the system
suspend.

Patch 4 brings in the new UAPI and PSCI call, guarded behind a VM
capability for backwards compatibility.

Patch 5 is indirectly related to this series, and avoids compiler
reordering on PSCI calls in the selftest introduced by "selftests: KVM:
Introduce psci_cpu_on_test".

Finally, patch 6 extends the PSCI selftest to verify the
SYSTEM_SUSPEND PSCI call behaves as intended.

These patches apply cleanly to kvmarm/next at the following commit:

f2267b87ecd5 ("Merge branch kvm-arm64/misc-5.15 into kvmarm-master/next")

The series is intentionally based on kvmarm/next for the sake of fixing
patches only present there in [1/6] and [5/6]. Tested on QEMU (ick)
since my Mt. Jade box is out to lunch at the moment and for some unknown
reason the toolchain on my work computer doesn't play nice with the FVP.

Oliver Upton (6):
  KVM: arm64: Drop unused vcpu param to kvm_psci_valid_affinity()
  KVM: arm64: Clean up SMC64 PSCI filtering for AArch32 guests
  KVM: arm64: Encapsulate reset request logic in a helper function
  KVM: arm64: Add support for SYSTEM_SUSPEND PSCI call
  selftests: KVM: Promote PSCI hypercalls to asm volatile
  selftests: KVM: Test SYSTEM_SUSPEND PSCI call

 arch/arm64/include/asm/kvm_host.h             |   3 +
 arch/arm64/kvm/arm.c                          |   5 +
 arch/arm64/kvm/psci.c                         | 134 +++++++++++++-----
 include/uapi/linux/kvm.h                      |   2 +
 .../selftests/kvm/aarch64/psci_cpu_on_test.c  | 126 +++++++++++-----
 5 files changed, 202 insertions(+), 68 deletions(-)

Comments

Oliver Upton Aug. 22, 2021, 7:56 p.m. UTC | #1
Marc,

On Thu, Aug 19, 2021 at 3:36 PM Oliver Upton <oupton@google.com> wrote:
>
> Certain VMMs/operators may wish to give their guests the ability to
> initiate a system suspend that could result in the VM being saved to
> persistent storage to be resumed at a later time. The PSCI v1.0
> specification describes an SMC, SYSTEM_SUSPEND, that allows a kernel to
> request a system suspend. This call is optional for v1.0, and KVM
> elected to not support the call in its v1.0 implementation.
>
> This series adds support for the SYSTEM_SUSPEND PSCI call to KVM/arm64.
> Since this is a system-scoped event, KVM cannot quiesce the VM on its
> own. We add a new system exit type in this series to clue in userspace
> that a suspend was requested. Per the KVM_EXIT_SYSTEM_EVENT ABI, a VMM
> that doesn't care about this event can simply resume the guest without
> issue (we set up the calling vCPU to come out of reset correctly on next
> KVM_RUN).
>
> Patch 1 is unrelated, and is a fix for "KVM: arm64: Enforce reserved
> bits for PSCI target affinities" on the kvmarm/next branch. Nothing
> particularly hairy, just an unused param.

Title line may not have been clear on this series, Patch 1 is a fix
for the PSCI CPU_ON series that's in kvmarm/next to suppress a
compiler warning.

> Patch 2 simplifies the function to check if KVM allows a particular PSCI
> function. We can generally disallow any PSCI function that sets the
> SMC64 bit in the PSCI function ID.
>
> Patch 3 wraps up the PSCI reset logic used for CPU_ON, which will be
> needed later to queue up a reset on the vCPU that requested the system
> suspend.
>
> Patch 4 brings in the new UAPI and PSCI call, guarded behind a VM
> capability for backwards compatibility.
>
> Patch 5 is indirectly related to this series, and avoids compiler
> reordering on PSCI calls in the selftest introduced by "selftests: KVM:
> Introduce psci_cpu_on_test".

This too is a fix for the PSCI CPU_ON series. Just wanted to raise it
to your attention beyond the new feature I'm working on here.

--
Thanks,
Oliver

> Finally, patch 6 extends the PSCI selftest to verify the
> SYSTEM_SUSPEND PSCI call behaves as intended.
>
> These patches apply cleanly to kvmarm/next at the following commit:
>
> f2267b87ecd5 ("Merge branch kvm-arm64/misc-5.15 into kvmarm-master/next")
>
> The series is intentionally based on kvmarm/next for the sake of fixing
> patches only present there in [1/6] and [5/6]. Tested on QEMU (ick)
> since my Mt. Jade box is out to lunch at the moment and for some unknown
> reason the toolchain on my work computer doesn't play nice with the FVP.
>
> Oliver Upton (6):
>   KVM: arm64: Drop unused vcpu param to kvm_psci_valid_affinity()
>   KVM: arm64: Clean up SMC64 PSCI filtering for AArch32 guests
>   KVM: arm64: Encapsulate reset request logic in a helper function
>   KVM: arm64: Add support for SYSTEM_SUSPEND PSCI call
>   selftests: KVM: Promote PSCI hypercalls to asm volatile
>   selftests: KVM: Test SYSTEM_SUSPEND PSCI call
>
>  arch/arm64/include/asm/kvm_host.h             |   3 +
>  arch/arm64/kvm/arm.c                          |   5 +
>  arch/arm64/kvm/psci.c                         | 134 +++++++++++++-----
>  include/uapi/linux/kvm.h                      |   2 +
>  .../selftests/kvm/aarch64/psci_cpu_on_test.c  | 126 +++++++++++-----
>  5 files changed, 202 insertions(+), 68 deletions(-)
>
> --
> 2.33.0.rc2.250.ged5fa647cd-goog
>
Marc Zyngier Aug. 26, 2021, 10:51 a.m. UTC | #2
On Sun, 22 Aug 2021 20:56:13 +0100,
Oliver Upton <oupton@google.com> wrote:
> 
> Marc,
> 
> On Thu, Aug 19, 2021 at 3:36 PM Oliver Upton <oupton@google.com> wrote:
> >
> > Certain VMMs/operators may wish to give their guests the ability to
> > initiate a system suspend that could result in the VM being saved to
> > persistent storage to be resumed at a later time. The PSCI v1.0
> > specification describes an SMC, SYSTEM_SUSPEND, that allows a kernel to
> > request a system suspend. This call is optional for v1.0, and KVM
> > elected to not support the call in its v1.0 implementation.
> >
> > This series adds support for the SYSTEM_SUSPEND PSCI call to KVM/arm64.
> > Since this is a system-scoped event, KVM cannot quiesce the VM on its
> > own. We add a new system exit type in this series to clue in userspace
> > that a suspend was requested. Per the KVM_EXIT_SYSTEM_EVENT ABI, a VMM
> > that doesn't care about this event can simply resume the guest without
> > issue (we set up the calling vCPU to come out of reset correctly on next
> > KVM_RUN).
> >
> > Patch 1 is unrelated, and is a fix for "KVM: arm64: Enforce reserved
> > bits for PSCI target affinities" on the kvmarm/next branch. Nothing
> > particularly hairy, just an unused param.
> 
> Title line may not have been clear on this series, Patch 1 is a fix
> for the PSCI CPU_ON series that's in kvmarm/next to suppress a
> compiler warning.

I'm not getting this warning. What are you compiling with? In general,
the compiler should shout about unused function parameters.

> 
> > Patch 2 simplifies the function to check if KVM allows a particular PSCI
> > function. We can generally disallow any PSCI function that sets the
> > SMC64 bit in the PSCI function ID.
> >
> > Patch 3 wraps up the PSCI reset logic used for CPU_ON, which will be
> > needed later to queue up a reset on the vCPU that requested the system
> > suspend.
> >
> > Patch 4 brings in the new UAPI and PSCI call, guarded behind a VM
> > capability for backwards compatibility.
> >
> > Patch 5 is indirectly related to this series, and avoids compiler
> > reordering on PSCI calls in the selftest introduced by "selftests: KVM:
> > Introduce psci_cpu_on_test".
> 
> This too is a fix for the PSCI CPU_ON series. Just wanted to raise it
> to your attention beyond the new feature I'm working on here.

I'm not sure this actually need fixing. The dependencies on the input
and output will effectively prevent such reordering. That will
definitely be a good cleanup though, but maybe not worth taking out of
this series.

Thanks,

	M. (trying to find excuses to close the tree quickly).
Oliver Upton Aug. 26, 2021, 6:37 p.m. UTC | #3
On Thu, Aug 26, 2021 at 11:51:00AM +0100, Marc Zyngier wrote:
> > > Patch 1 is unrelated, and is a fix for "KVM: arm64: Enforce reserved
> > > bits for PSCI target affinities" on the kvmarm/next branch. Nothing
> > > particularly hairy, just an unused param.
> > 
> > Title line may not have been clear on this series, Patch 1 is a fix
> > for the PSCI CPU_ON series that's in kvmarm/next to suppress a
> > compiler warning.
> 
> I'm not getting this warning. What are you compiling with? In general,
> the compiler should shout about unused function parameters.

Gah, this is just with local tooling. I'm unable to repro using
GCC/Clang. I see that '-Wno-unused-parameter' is set alongside
'-Wunused' when W=1.

> > > Patch 5 is indirectly related to this series, and avoids compiler
> > > reordering on PSCI calls in the selftest introduced by "selftests: KVM:
> > > Introduce psci_cpu_on_test".
> > 
> > This too is a fix for the PSCI CPU_ON series. Just wanted to raise it
> > to your attention beyond the new feature I'm working on here.
> 
> I'm not sure this actually need fixing. The dependencies on the input
> and output will effectively prevent such reordering. That will
> definitely be a good cleanup though, but maybe not worth taking out of
> this series.

Yep, you're right. There's an obvious dependency in the test that
maintains program order. I realize that it is only the second test
(patch 6 in this series) where things get hairy.

Apologies for the noise.

--
Thanks,
Oliver
Marc Zyngier Sept. 6, 2021, 9:12 a.m. UTC | #4
Hi Oliver,

On Thu, 19 Aug 2021 23:36:34 +0100,
Oliver Upton <oupton@google.com> wrote:
> 
> Certain VMMs/operators may wish to give their guests the ability to
> initiate a system suspend that could result in the VM being saved to
> persistent storage to be resumed at a later time. The PSCI v1.0
> specification describes an SMC, SYSTEM_SUSPEND, that allows a kernel to
> request a system suspend. This call is optional for v1.0, and KVM
> elected to not support the call in its v1.0 implementation.
> 
> This series adds support for the SYSTEM_SUSPEND PSCI call to KVM/arm64.
> Since this is a system-scoped event, KVM cannot quiesce the VM on its
> own. We add a new system exit type in this series to clue in userspace
> that a suspend was requested. Per the KVM_EXIT_SYSTEM_EVENT ABI, a VMM
> that doesn't care about this event can simply resume the guest without
> issue (we set up the calling vCPU to come out of reset correctly on next
> KVM_RUN).

More idle thoughts on this:

Although the definition of SYSTEM_SUSPEND is very simple from a PSCI
perspective, I don't think it is that simple at the system level,
because PSCI is only concerned with the CPU.

For example, what is a wake-up event? My first approach would be to
consider interrupts to be such events. However, this approach suffers
from at least two issues:

- How do you define which interrupts are actual wake-up events?
  Nothing in the GIC architecture defines what a wake-up is (let alone
  a wake-up event).

- Assuming you have a way to express the above, how do you handle
  wake-ups from interrupts that have their source in the kernel (such
  as timers, irqfd sources)? How do you cope with directly injected
  interrupts?

It looks to me that your implementation can only work with userspace
provided events, which is pretty limited.

Other items worth considering: ongoing DMA, state of the caches at
suspend time, device state in general All of this really needs to be
defined before we can move forward with this feature.

Thanks,

	M.
Oliver Upton Sept. 7, 2021, 4:30 p.m. UTC | #5
On Mon, Sep 6, 2021 at 4:12 AM Marc Zyngier <maz@kernel.org> wrote:
>
> Hi Oliver,
>
> On Thu, 19 Aug 2021 23:36:34 +0100,
> Oliver Upton <oupton@google.com> wrote:
> >
> > Certain VMMs/operators may wish to give their guests the ability to
> > initiate a system suspend that could result in the VM being saved to
> > persistent storage to be resumed at a later time. The PSCI v1.0
> > specification describes an SMC, SYSTEM_SUSPEND, that allows a kernel to
> > request a system suspend. This call is optional for v1.0, and KVM
> > elected to not support the call in its v1.0 implementation.
> >
> > This series adds support for the SYSTEM_SUSPEND PSCI call to KVM/arm64.
> > Since this is a system-scoped event, KVM cannot quiesce the VM on its
> > own. We add a new system exit type in this series to clue in userspace
> > that a suspend was requested. Per the KVM_EXIT_SYSTEM_EVENT ABI, a VMM
> > that doesn't care about this event can simply resume the guest without
> > issue (we set up the calling vCPU to come out of reset correctly on next
> > KVM_RUN).
>
> More idle thoughts on this:
>
> Although the definition of SYSTEM_SUSPEND is very simple from a PSCI
> perspective, I don't think it is that simple at the system level,
> because PSCI is only concerned with the CPU.
>
> For example, what is a wake-up event? My first approach would be to
> consider interrupts to be such events. However, this approach suffers
> from at least two issues:
>
> - How do you define which interrupts are actual wake-up events?
>   Nothing in the GIC architecture defines what a wake-up is (let alone
>   a wake-up event).

Good point.

One possible implementation of suspend could just be a `WFI` in a
higher EL. In this case, KVM could emulate WFI wake up events
according to D1.16.2 in DDI 0487G.a. But I agree, it isn't entirely
clear what constitutes a wakeup from powered down state.

> - Assuming you have a way to express the above, how do you handle
>   wake-ups from interrupts that have their source in the kernel (such
>   as timers, irqfd sources)?

I think this could be handled, so long as we allow userspace to
indicate it has woken a vCPU. Depending on this, in the next KVM_RUN
we'd say:

- Some IMP DEF event occurred; I'm waking this CPU now
- I've either chosen to ignore the guest or will defer to KVM's
suspend implementation

> How do you cope with directly injected interrupts?

No expert on this, I'll need to do a bit more reading to give a good
answer here.

> It looks to me that your implementation can only work with userspace
> provided events, which is pretty limited.

Right. I implemented this from the mindset that userspace may do
something heavyweight when a guest suspends, like save it to a
persistent store to resume later on. No matter what we do in KVM, I
think it's probably best to give userspace the right of first refusal
to handle the suspension.

> Other items worth considering: ongoing DMA, state of the caches at
> suspend time, device state in general All of this really needs to be
> defined before we can move forward with this feature.

I believe it is largely up to the caller to get devices in a quiesced
state appropriate for a system suspend, but PSCI is delightfully vague
on this topic. On the contrary, it is up to KVM's implementation to
guarantee caches are clean when servicing the guest request.

I'll crank on this a bit more and see if I have more thoughts.

--
Thanks,
Oliver
Marc Zyngier Sept. 7, 2021, 5:43 p.m. UTC | #6
On Tue, 07 Sep 2021 17:30:33 +0100,
Oliver Upton <oupton@google.com> wrote:
> 
> On Mon, Sep 6, 2021 at 4:12 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > Hi Oliver,
> >
> > On Thu, 19 Aug 2021 23:36:34 +0100,
> > Oliver Upton <oupton@google.com> wrote:
> > >
> > > Certain VMMs/operators may wish to give their guests the ability to
> > > initiate a system suspend that could result in the VM being saved to
> > > persistent storage to be resumed at a later time. The PSCI v1.0
> > > specification describes an SMC, SYSTEM_SUSPEND, that allows a kernel to
> > > request a system suspend. This call is optional for v1.0, and KVM
> > > elected to not support the call in its v1.0 implementation.
> > >
> > > This series adds support for the SYSTEM_SUSPEND PSCI call to KVM/arm64.
> > > Since this is a system-scoped event, KVM cannot quiesce the VM on its
> > > own. We add a new system exit type in this series to clue in userspace
> > > that a suspend was requested. Per the KVM_EXIT_SYSTEM_EVENT ABI, a VMM
> > > that doesn't care about this event can simply resume the guest without
> > > issue (we set up the calling vCPU to come out of reset correctly on next
> > > KVM_RUN).
> >
> > More idle thoughts on this:
> >
> > Although the definition of SYSTEM_SUSPEND is very simple from a PSCI
> > perspective, I don't think it is that simple at the system level,
> > because PSCI is only concerned with the CPU.
> >
> > For example, what is a wake-up event? My first approach would be to
> > consider interrupts to be such events. However, this approach suffers
> > from at least two issues:
> >
> > - How do you define which interrupts are actual wake-up events?
> >   Nothing in the GIC architecture defines what a wake-up is (let alone
> >   a wake-up event).
> 
> Good point.
> 
> One possible implementation of suspend could just be a `WFI` in a
> higher EL. In this case, KVM could emulate WFI wake up events
> according to D1.16.2 in DDI 0487G.a. But I agree, it isn't entirely
> clear what constitutes a wakeup from powered down state.

It isn't, and it is actually IMPDEF (there isn't much in the ARM ARM
in terms of what constitutes a low power state). And even if you
wanted to emulate a WFI in userspace, the problem of interrupts that
have their source in the kernel remains. How to you tell userspace
that such an event has occurred if the vcpu thread isn't in the
kernel?

> 
> > - Assuming you have a way to express the above, how do you handle
> >   wake-ups from interrupts that have their source in the kernel (such
> >   as timers, irqfd sources)?
> 
> I think this could be handled, so long as we allow userspace to
> indicate it has woken a vCPU. Depending on this, in the next KVM_RUN
> we'd say:
> 
> - Some IMP DEF event occurred; I'm waking this CPU now

I'm seeing the problem from the other side. The vcpu has exited to
userspace on a SUSPEND event. How is the kernel supposed to tell
userspace that there is a pending interrupt? To do that, you'd have to
keep the vcpu in the kernel on SUSPEND. Which is *exactly* WFI.

> - I've either chosen to ignore the guest or will defer to KVM's
> suspend implementation
> 
> > How do you cope with directly injected interrupts?
> 
> No expert on this, I'll need to do a bit more reading to give a good
> answer here.
> 
> > It looks to me that your implementation can only work with userspace
> > provided events, which is pretty limited.
> 
> Right. I implemented this from the mindset that userspace may do
> something heavyweight when a guest suspends, like save it to a
> persistent store to resume later on. No matter what we do in KVM, I
> think it's probably best to give userspace the right of first refusal
> to handle the suspension.

Maybe. But if you want to handle wake-up from interrupts to actually
work, you must return to the kernel for the wake-up to occurs.

The problem is that you piggyback on an existing feature (suspend) to
implement something else (opportunistic save/restore?). Oddly enough
the stars don't exactly align! ;-)

I have the feeling that a solution to this problem would be to exit to
userspace with something indicating an *intent* to suspend. At this
stage, userspace can do two things:

- resume the guest: the guest may have been moved to some other
  machine, cold storage, whatever... The important thing is that the
  guest is directly runnable without any extra event

- confirm the suspension by returning to the kernel, which will
  execute a blocking WFI on behalf of the guest

With this, you end-up with something that is works from an interrupt
perspective (even for directly injected interrupts), and you can save
your guest on suspend.

>
> > Other items worth considering: ongoing DMA, state of the caches at
> > suspend time, device state in general All of this really needs to be
> > defined before we can move forward with this feature.
> 
> I believe it is largely up to the caller to get devices in a quiesced
> state appropriate for a system suspend, but PSCI is delightfully vague
> on this topic.

Indeed, it only deals with the CPU. Oh look, another opportunity to
write a new spec! :)

> On the contrary, it is up to KVM's implementation to
> guarantee caches are clean when servicing the guest request.

This last point is pretty unclear to me. If the guest doesn't clean to
the PoC (or even to one of the PoPs) when it calls into suspend,
that's a clear indication that it doesn't care about its data. Why
should KVM be more conservative here? It shouldn't be in the business
of working around guest bugs.

Thanks,

	M.
Oliver Upton Sept. 7, 2021, 6:14 p.m. UTC | #7
On Tue, Sep 7, 2021 at 12:43 PM Marc Zyngier <maz@kernel.org> wrote:
> > > Although the definition of SYSTEM_SUSPEND is very simple from a PSCI
> > > perspective, I don't think it is that simple at the system level,
> > > because PSCI is only concerned with the CPU.
> > >
> > > For example, what is a wake-up event? My first approach would be to
> > > consider interrupts to be such events. However, this approach suffers
> > > from at least two issues:
> > >
> > > - How do you define which interrupts are actual wake-up events?
> > >   Nothing in the GIC architecture defines what a wake-up is (let alone
> > >   a wake-up event).
> >
> > Good point.
> >
> > One possible implementation of suspend could just be a `WFI` in a
> > higher EL. In this case, KVM could emulate WFI wake up events
> > according to D1.16.2 in DDI 0487G.a. But I agree, it isn't entirely
> > clear what constitutes a wakeup from powered down state.
>
> It isn't, and it is actually IMPDEF (there isn't much in the ARM ARM
> in terms of what constitutes a low power state). And even if you
> wanted to emulate a WFI in userspace, the problem of interrupts that
> have their source in the kernel remains. How to you tell userspace
> that such an event has occurred if the vcpu thread isn't in the
> kernel?

Well, are there any objections to saying for the KVM implementation we
observe the WFI wake-up events per the cited section of the ARM ARM?

> > > It looks to me that your implementation can only work with userspace
> > > provided events, which is pretty limited.
> >
> > Right. I implemented this from the mindset that userspace may do
> > something heavyweight when a guest suspends, like save it to a
> > persistent store to resume later on. No matter what we do in KVM, I
> > think it's probably best to give userspace the right of first refusal
> > to handle the suspension.
>
> Maybe. But if you want to handle wake-up from interrupts to actually
> work, you must return to the kernel for the wake-up to occurs.
>
> The problem is that you piggyback on an existing feature (suspend) to
> implement something else (opportunistic save/restore?). Oddly enough
> the stars don't exactly align! ;-)
>
> I have the feeling that a solution to this problem would be to exit to
> userspace with something indicating an *intent* to suspend. At this
> stage, userspace can do two things:
>
> - resume the guest: the guest may have been moved to some other
>   machine, cold storage, whatever... The important thing is that the
>   guest is directly runnable without any extra event
>
> - confirm the suspension by returning to the kernel, which will
>   execute a blocking WFI on behalf of the guest
>
> With this, you end-up with something that is works from an interrupt
> perspective (even for directly injected interrupts), and you can save
> your guest on suspend.

This is exactly what I was trying to get at with my last mail,
although not quite as eloquently stated. So I completely agree.

Just to check understanding for v2:

We agree that an exit to userspace is fine so it has the opportunity
to do something crazy when the guest attempts a suspend. If a VMM does
nothing and immediately re-enters the kernel, we emulate the suspend
there by waiting for some event to fire, which for our purposes we
will say is an interrupt originating from userspace or the kernel
(WFI). In all, the SUSPEND exit type does not indicate that emulation
terminates with the VMM. It only indicates we are about to block in
the kernel.

If there is some IMPDEF event specific to the VMM, it should signal
the vCPU thread to kick it out of the kernel, make it runnable, and
re-enter. No need to do anything special from the kernel perspective
for this. This is only for the case where we decide to block in the
kernel.

>
> >
> > > Other items worth considering: ongoing DMA, state of the caches at
> > > suspend time, device state in general All of this really needs to be
> > > defined before we can move forward with this feature.
> >
> > I believe it is largely up to the caller to get devices in a quiesced
> > state appropriate for a system suspend, but PSCI is delightfully vague
> > on this topic.
>
> Indeed, it only deals with the CPU. Oh look, another opportunity to
> write a new spec! :)
>
> > On the contrary, it is up to KVM's implementation to
> > guarantee caches are clean when servicing the guest request.
>
> This last point is pretty unclear to me. If the guest doesn't clean to
> the PoC (or even to one of the PoPs) when it calls into suspend,
> that's a clear indication that it doesn't care about its data. Why
> should KVM be more conservative here? It shouldn't be in the business
> of working around guest bugs.

PSCI is vague on this, sadly. DEN0022D.b, 5.4.8 "Implementation
responsibilities: Cache and coherency management states" that for
CPU_SUSPEND, the PSCI implementation must perform a cache clean
operation before entering the powerdown state. I don't see any reason
why SYSTEM_SUSPEND should be excluded from this requirement.

--
Thanks,
Oliver
Marc Zyngier Sept. 21, 2021, 9:45 a.m. UTC | #8
Hi Oliver,

On Tue, 07 Sep 2021 19:14:00 +0100,
Oliver Upton <oupton@google.com> wrote:
> 
> On Tue, Sep 7, 2021 at 12:43 PM Marc Zyngier <maz@kernel.org> wrote:
> > > > Although the definition of SYSTEM_SUSPEND is very simple from a PSCI
> > > > perspective, I don't think it is that simple at the system level,
> > > > because PSCI is only concerned with the CPU.
> > > >
> > > > For example, what is a wake-up event? My first approach would be to
> > > > consider interrupts to be such events. However, this approach suffers
> > > > from at least two issues:
> > > >
> > > > - How do you define which interrupts are actual wake-up events?
> > > >   Nothing in the GIC architecture defines what a wake-up is (let alone
> > > >   a wake-up event).
> > >
> > > Good point.
> > >
> > > One possible implementation of suspend could just be a `WFI` in a
> > > higher EL. In this case, KVM could emulate WFI wake up events
> > > according to D1.16.2 in DDI 0487G.a. But I agree, it isn't entirely
> > > clear what constitutes a wakeup from powered down state.
> >
> > It isn't, and it is actually IMPDEF (there isn't much in the ARM ARM
> > in terms of what constitutes a low power state). And even if you
> > wanted to emulate a WFI in userspace, the problem of interrupts that
> > have their source in the kernel remains. How to you tell userspace
> > that such an event has occurred if the vcpu thread isn't in the
> > kernel?
> 
> Well, are there any objections to saying for the KVM implementation we
> observe the WFI wake-up events per the cited section of the ARM ARM?

These are fine. However, what of the GIC, for example? Can any GIC
interrupt wake-up the guest? I'm happy to say "yes" to this, but I
suspect others will have a different idea, and the thought of
introducing an IMPDEF wake-up interrupt controller doesn't fill me
with joy.

> > > > It looks to me that your implementation can only work with userspace
> > > > provided events, which is pretty limited.
> > >
> > > Right. I implemented this from the mindset that userspace may do
> > > something heavyweight when a guest suspends, like save it to a
> > > persistent store to resume later on. No matter what we do in KVM, I
> > > think it's probably best to give userspace the right of first refusal
> > > to handle the suspension.
> >
> > Maybe. But if you want to handle wake-up from interrupts to actually
> > work, you must return to the kernel for the wake-up to occurs.
> >
> > The problem is that you piggyback on an existing feature (suspend) to
> > implement something else (opportunistic save/restore?). Oddly enough
> > the stars don't exactly align! ;-)
> >
> > I have the feeling that a solution to this problem would be to exit to
> > userspace with something indicating an *intent* to suspend. At this
> > stage, userspace can do two things:
> >
> > - resume the guest: the guest may have been moved to some other
> >   machine, cold storage, whatever... The important thing is that the
> >   guest is directly runnable without any extra event
> >
> > - confirm the suspension by returning to the kernel, which will
> >   execute a blocking WFI on behalf of the guest
> >
> > With this, you end-up with something that is works from an interrupt
> > perspective (even for directly injected interrupts), and you can save
> > your guest on suspend.
> 
> This is exactly what I was trying to get at with my last mail,
> although not quite as eloquently stated. So I completely agree.

Ah! Good! :D

> Just to check understanding for v2:
> 
> We agree that an exit to userspace is fine so it has the opportunity
> to do something crazy when the guest attempts a suspend. If a VMM does
> nothing and immediately re-enters the kernel, we emulate the suspend
> there by waiting for some event to fire, which for our purposes we
> will say is an interrupt originating from userspace or the kernel
> (WFI). In all, the SUSPEND exit type does not indicate that emulation
> terminates with the VMM. It only indicates we are about to block in
> the kernel.
> 
> If there is some IMPDEF event specific to the VMM, it should signal
> the vCPU thread to kick it out of the kernel, make it runnable, and
> re-enter. No need to do anything special from the kernel perspective
> for this. This is only for the case where we decide to block in the
> kernel.

This looks sensible. One question though: I think there is an implicit
requirement that the guest should be "migratable" in that state. How
does the above handles it? If the "suspend state" is solely held in
the kernel, we need to be able to snapshot it, and I don't like the
sound of that...

We could instead keep the "suspend state" in the VMM:

On PSCI_SUSPEND, the guest exits to userspace. If the VMM wants to
honour the supend request, it reenters the guest with RUN+SUSPEND,
which results in a WFI. On each wake-up, the guest exits to userspace,
and it is the VMM responsibility to either perform the wake-up (RUN)
or stay in suspend (RUN+SUSPEND).

This ensures that the guest never transitions out of suspend without
the VMM knowing, and the VMM can always force a resume by kicking the
thread back to userspace.

Thoughts?

> > > > Other items worth considering: ongoing DMA, state of the caches at
> > > > suspend time, device state in general All of this really needs to be
> > > > defined before we can move forward with this feature.
> > >
> > > I believe it is largely up to the caller to get devices in a quiesced
> > > state appropriate for a system suspend, but PSCI is delightfully vague
> > > on this topic.
> >
> > Indeed, it only deals with the CPU. Oh look, another opportunity to
> > write a new spec! :)
> >
> > > On the contrary, it is up to KVM's implementation to
> > > guarantee caches are clean when servicing the guest request.
> >
> > This last point is pretty unclear to me. If the guest doesn't clean to
> > the PoC (or even to one of the PoPs) when it calls into suspend,
> > that's a clear indication that it doesn't care about its data. Why
> > should KVM be more conservative here? It shouldn't be in the business
> > of working around guest bugs.
> 
> PSCI is vague on this, sadly. DEN0022D.b, 5.4.8 "Implementation
> responsibilities: Cache and coherency management states" that for
> CPU_SUSPEND, the PSCI implementation must perform a cache clean
> operation before entering the powerdown state. I don't see any reason
> why SYSTEM_SUSPEND should be excluded from this requirement.

I'm not sure that's the case. CPU_SUSPEND may not use the resume
entry-point if the suspend results is a shallower state than expected
(i.e. the call just returns instead of behaving like a CPU boot).

However, a successful SYSTEM_SUSPEND always results in the deepest
possible state. The guest should know that. There is also the fact
that performing a full clean to the PoC is going to be pretty
expensive, and I'd like to avoid that.

I'll try and reach out to some of the ARM folks for clarification on
the matter.

Thanks,

	M.
Oliver Upton Sept. 21, 2021, 6:22 p.m. UTC | #9
Hey Marc,

On Tue, Sep 21, 2021 at 10:45:22AM +0100, Marc Zyngier wrote:
> > > > > - How do you define which interrupts are actual wake-up events?
> > > > >   Nothing in the GIC architecture defines what a wake-up is (let alone
> > > > >   a wake-up event).
> > > >
> > > > Good point.
> > > >
> > > > One possible implementation of suspend could just be a `WFI` in a
> > > > higher EL. In this case, KVM could emulate WFI wake up events
> > > > according to D1.16.2 in DDI 0487G.a. But I agree, it isn't entirely
> > > > clear what constitutes a wakeup from powered down state.
> > >
> > > It isn't, and it is actually IMPDEF (there isn't much in the ARM ARM
> > > in terms of what constitutes a low power state). And even if you
> > > wanted to emulate a WFI in userspace, the problem of interrupts that
> > > have their source in the kernel remains. How to you tell userspace
> > > that such an event has occurred if the vcpu thread isn't in the
> > > kernel?
> > 
> > Well, are there any objections to saying for the KVM implementation we
> > observe the WFI wake-up events per the cited section of the ARM ARM?
> 
> These are fine. However, what of the GIC, for example? Can any GIC
> interrupt wake-up the guest? I'm happy to say "yes" to this, but I
> suspect others will have a different idea, and the thought of
> introducing an IMPDEF wake-up interrupt controller doesn't fill me
> with joy.
>

I'm planning to propose exactly this in the next series; any GIC
interrupt will wake the guest. I'd argue that if someone wants to do
anything else, their window of opportunity is the exit to userspace.

[...]

> > Just to check understanding for v2:
> > 
> > We agree that an exit to userspace is fine so it has the opportunity
> > to do something crazy when the guest attempts a suspend. If a VMM does
> > nothing and immediately re-enters the kernel, we emulate the suspend
> > there by waiting for some event to fire, which for our purposes we
> > will say is an interrupt originating from userspace or the kernel
> > (WFI). In all, the SUSPEND exit type does not indicate that emulation
> > terminates with the VMM. It only indicates we are about to block in
> > the kernel.
> > 
> > If there is some IMPDEF event specific to the VMM, it should signal
> > the vCPU thread to kick it out of the kernel, make it runnable, and
> > re-enter. No need to do anything special from the kernel perspective
> > for this. This is only for the case where we decide to block in the
> > kernel.
> 
> This looks sensible. One question though: I think there is an implicit
> requirement that the guest should be "migratable" in that state. How
> does the above handles it? If the "suspend state" is solely held in
> the kernel, we need to be able to snapshot it, and I don't like the
> sound of that...
> 
> We could instead keep the "suspend state" in the VMM:
> 
> On PSCI_SUSPEND, the guest exits to userspace. If the VMM wants to
> honour the supend request, it reenters the guest with RUN+SUSPEND,
> which results in a WFI. On each wake-up, the guest exits to userspace,
> and it is the VMM responsibility to either perform the wake-up (RUN)
> or stay in suspend (RUN+SUSPEND).
> 
> This ensures that the guest never transitions out of suspend without
> the VMM knowing, and the VMM can always force a resume by kicking the
> thread back to userspace.
> 
> Thoughts?

Agreed. I was mulling on exactly how to clue in the VMM about the
suspend state. What if we just encode it in KVM_{GET,SET}_MP_STATE? We'd
avoid the need for new UAPI that way. We could introduce a new state,
KVM_MP_STATE_SUSPENDED, which would clue KVM to do the suspend as we've
discussed. We would exit to userspace with KVM_MP_STATE_RUNNABLE,
meaning the VMM would need to set the MP state explicitly for the
in-kernel suspend to work.

[...]

> > > > On the contrary, it is up to KVM's implementation to
> > > > guarantee caches are clean when servicing the guest request.
> > >
> > > This last point is pretty unclear to me. If the guest doesn't clean to
> > > the PoC (or even to one of the PoPs) when it calls into suspend,
> > > that's a clear indication that it doesn't care about its data. Why
> > > should KVM be more conservative here? It shouldn't be in the business
> > > of working around guest bugs.
> > 
> > PSCI is vague on this, sadly. DEN0022D.b, 5.4.8 "Implementation
> > responsibilities: Cache and coherency management states" that for
> > CPU_SUSPEND, the PSCI implementation must perform a cache clean
> > operation before entering the powerdown state. I don't see any reason
> > why SYSTEM_SUSPEND should be excluded from this requirement.
> 
> I'm not sure that's the case. CPU_SUSPEND may not use the resume
> entry-point if the suspend results is a shallower state than expected
> (i.e. the call just returns instead of behaving like a CPU boot).
> 
> However, a successful SYSTEM_SUSPEND always results in the deepest
> possible state. The guest should know that. There is also the fact
> that performing a full clean to the PoC is going to be pretty
> expensive, and I'd like to avoid that.
> 
> I'll try and reach out to some of the ARM folks for clarification on
> the matter.

That'd be very helpful!

--
Thanks,
Oliver