mbox series

[RFC,v2,0/4] arm64: Add PSCI v1.3 SYSTEM_OFF2 support for hibernation

Message ID 20240318164646.1010092-1-dwmw2@infradead.org (mailing list archive)
Headers show
Series arm64: Add PSCI v1.3 SYSTEM_OFF2 support for hibernation | expand

Message

David Woodhouse March 18, 2024, 4:14 p.m. UTC
The PSCI v1.3 spec (https://developer.arm.com/documentation/den0022, 
currently in Alpha state, hence 'RFC') adds support for a SYSTEM_OFF2 
function enabling a HIBERNATE_OFF state which is analogous to ACPI S4. 
This will allow hosting environments to determine that a guest is 
hibernated rather than just powered off, and ensure that they preserve 
the virtual environment appropriately to allow the guest to resume 
safely (or bump the hardware_signature in the FACS to trigger a clean 
reboot instead).

This adds support for it to KVM, exactly the same way as the existing 
support for SYSTEM_RESET2 as added in commits d43583b890e7 ("KVM: arm64: 
Expose PSCI SYSTEM_RESET2 call to the guest") and 34739fd95fab ("KVM: 
arm64: Indicate SYSTEM_RESET2 in kvm_run::system_event flags field").

Back then, KVM was unconditionally bumped to expose PSCI v1.1. This 
means that a kernel upgrade causes guest visible behaviour changes 
without any explicit opt-in from the VMM, which is... unconventional. In 
some cases, a PSCI update isn't just about new optional calls; PSCI v1.2 
for example adds a new permitted error return from the existing CPU_ON 
function.

There *is* a way for a VMM to opt *out* of newer PSCI versions... by 
setting a per-vCPU "special" register that actually ends up setting the 
PSCI version KVM-wide. Quite why this isn't just a simple KVM_CAP, I 
have no idea. There *is* a KVM_CAP_ARM_PSCI_0_2 but that's just for 0.1 
vs. 0.2+, not the specific v0.2+ version that's exposed.

Since the SYSTEM_OFF2 call is optional and discoverable through the 
PSCI_FEATURES call, I'm electing not to touch the PSCI versioning 
awfulness at all. Like the existing SYSTEM_RESET2, there's a KVM_CAP to 
enable it explicitly (as it's an optional call even in v1.3), and like 
the existing SYSTEM_RESET2 it doesn't depend on the advertised PSCI 
version.

For the guest side, add a new SYS_OFF_MODE_POWER_OFF handler with higher 
priority than the EFI one, but which *only* triggers when there's a 
hibernation in progress. There are other ways to do this (see the commit
message for more details) but this seemed like the simplest.

Version 2 of the patch series splits out the psci.h definitions into a 
separate commit (a dependency for both the guest and KVM side), and adds 
definitions for the other new functions added in v1.3. It also moves the 
pKVM psci-relay support to a separate commit; although in arch/arm64/kvm 
that's actually about the *guest* side of SYSTEM_OFF2 (i.e. using it
from the host kernel, relayed through nVHE).

David Woodhouse (4):
      firmware/psci: Add definitions for PSCI v1.3 specification (ALPHA)
      KVM: arm64: Add PSCI SYSTEM_OFF2 function for hibernation
      KVM: arm64: nvhe: Pass through PSCI v1.3 SYSTEM_OFF2 call
      arm64: Use SYSTEM_OFF2 PSCI call to power off for hibernate

 Documentation/virt/kvm/api.rst       | 11 +++++++++++
 arch/arm64/include/asm/kvm_host.h    |  2 ++
 arch/arm64/include/uapi/asm/kvm.h    |  6 ++++++
 arch/arm64/kvm/arm.c                 |  5 +++++
 arch/arm64/kvm/hyp/nvhe/psci-relay.c |  2 ++
 arch/arm64/kvm/psci.c                | 37 ++++++++++++++++++++++++++++++++++++
 drivers/firmware/psci/psci.c         | 35 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h             |  1 +
 include/uapi/linux/psci.h            | 20 +++++++++++++++++++
 kernel/power/hibernate.c             |  5 ++++-
 10 files changed, 123 insertions(+), 1 deletion(-)

Comments

Marc Zyngier March 18, 2024, 4:57 p.m. UTC | #1
On Mon, 18 Mar 2024 16:14:22 +0000,
David Woodhouse <dwmw2@infradead.org> wrote:
> 
> The PSCI v1.3 spec (https://developer.arm.com/documentation/den0022, 
> currently in Alpha state, hence 'RFC') adds support for a SYSTEM_OFF2 
> function enabling a HIBERNATE_OFF state which is analogous to ACPI S4. 
> This will allow hosting environments to determine that a guest is 
> hibernated rather than just powered off, and ensure that they preserve 
> the virtual environment appropriately to allow the guest to resume 
> safely (or bump the hardware_signature in the FACS to trigger a clean 
> reboot instead).
> 
> This adds support for it to KVM, exactly the same way as the existing 
> support for SYSTEM_RESET2 as added in commits d43583b890e7 ("KVM: arm64: 
> Expose PSCI SYSTEM_RESET2 call to the guest") and 34739fd95fab ("KVM: 
> arm64: Indicate SYSTEM_RESET2 in kvm_run::system_event flags field").
> 
> Back then, KVM was unconditionally bumped to expose PSCI v1.1. This 
> means that a kernel upgrade causes guest visible behaviour changes 
> without any explicit opt-in from the VMM, which is... unconventional. In 
> some cases, a PSCI update isn't just about new optional calls; PSCI v1.2 
> for example adds a new permitted error return from the existing CPU_ON 
> function.
> 
> There *is* a way for a VMM to opt *out* of newer PSCI versions... by 
> setting a per-vCPU "special" register that actually ends up setting the 
> PSCI version KVM-wide. Quite why this isn't just a simple KVM_CAP, I 
> have no idea.

Because the expectations are that the VMM can blindly save/restore the
guest's state, including the PSCI version, and restore that blindly.
KVM CAPs are just a really bad design pattern for this sort of things.

	M.
David Woodhouse March 18, 2024, 5:26 p.m. UTC | #2
On Mon, 2024-03-18 at 16:57 +0000, Marc Zyngier wrote:
> 
> > 
> > There *is* a way for a VMM to opt *out* of newer PSCI versions... by 
> > setting a per-vCPU "special" register that actually ends up setting the 
> > PSCI version KVM-wide. Quite why this isn't just a simple KVM_CAP, I 
> > have no idea.
> 
> Because the expectations are that the VMM can blindly save/restore the
> guest's state, including the PSCI version, and restore that blindly.
> KVM CAPs are just a really bad design pattern for this sort of things.

Hm, am I missing something here? Does the *guest* get to set the PSCI
version somehow, and opt into the latest version that it understands
regardless of what the firmware/host can support?

Because if not, surely it's just part of the basic shape of the
machine, like "how many vCPUs does it have". You don't need to be able
to query it back again. 

I don't think we ever aspired to be able to hand an arbitrary KVM fd to
a userspace VMM and have the VMM be able to drive that VM without
having any a priori context, did we?
Marc Zyngier March 18, 2024, 5:41 p.m. UTC | #3
On Mon, 18 Mar 2024 17:26:07 +0000,
David Woodhouse <dwmw2@infradead.org> wrote:
> 
> [1  <text/plain; UTF-8 (quoted-printable)>]
> On Mon, 2024-03-18 at 16:57 +0000, Marc Zyngier wrote:
> > 
> > > 
> > > There *is* a way for a VMM to opt *out* of newer PSCI versions... by 
> > > setting a per-vCPU "special" register that actually ends up setting the 
> > > PSCI version KVM-wide. Quite why this isn't just a simple KVM_CAP, I 
> > > have no idea.
> > 
> > Because the expectations are that the VMM can blindly save/restore the
> > guest's state, including the PSCI version, and restore that blindly.
> > KVM CAPs are just a really bad design pattern for this sort of things.
> 
> Hm, am I missing something here? Does the *guest* get to set the PSCI
> version somehow, and opt into the latest version that it understands
> regardless of what the firmware/host can support?

No. The *VMM* sets the PSCI version by writing to a pseudo register.
It means that when the guest migrates, the VMM saves and restores that
version, and the guest doesn't see any change.

The host firmware has nothing to do with it, obviously. This is all
about KVM's own implementation of the "firmware", as seen by the guest.

> Because if not, surely it's just part of the basic shape of the
> machine, like "how many vCPUs does it have". You don't need to be able
> to query it back again.

Nobody needs to do this.

> I don't think we ever aspired to be able to hand an arbitrary KVM fd to
> a userspace VMM and have the VMM be able to drive that VM without
> having any a priori context, did we?

Arbitrary? No. This is actually very specific and pretty well
documented.

Also, to answer your question about why we treat 0.1 differently from
0.2+: 0.1 didn't specify the PSCI SMC/HCR encoding, meaning that KVM
implemented something that was never fully specified. The VMM has to
provide firmware tables that describe that. With 0.2+, there is a
standard encoding for all functions, and the VMM doesn't have to
provide the encoding to the guest.

	M.
David Woodhouse March 18, 2024, 6:15 p.m. UTC | #4
On Mon, 2024-03-18 at 17:41 +0000, Marc Zyngier wrote:
> On Mon, 18 Mar 2024 17:26:07 +0000,
> David Woodhouse <dwmw2@infradead.org> wrote:
> > 
> > [1  <text/plain; UTF-8 (quoted-printable)>]
> > On Mon, 2024-03-18 at 16:57 +0000, Marc Zyngier wrote:
> > > 
> > > > 
> > > > There *is* a way for a VMM to opt *out* of newer PSCI versions... by 
> > > > setting a per-vCPU "special" register that actually ends up setting the 
> > > > PSCI version KVM-wide. Quite why this isn't just a simple KVM_CAP, I 
> > > > have no idea.
> > > 
> > > Because the expectations are that the VMM can blindly save/restore the
> > > guest's state, including the PSCI version, and restore that blindly.
> > > KVM CAPs are just a really bad design pattern for this sort of things.
> > 
> > Hm, am I missing something here? Does the *guest* get to set the PSCI
> > version somehow, and opt into the latest version that it understands
> > regardless of what the firmware/host can support?
> 
> No. The *VMM* sets the PSCI version by writing to a pseudo register.
> It means that when the guest migrates, the VMM saves and restores that
> version, and the guest doesn't see any change.

And when you boot a guest image which has been working for years under
a new kernel+KVM, your guest suddenly experiences a new PSCI version.
As I said that's not just new optional functions; it's potentially even
returning new error codes to the functions that said guest was already
using.

And when you *hibernate* a guest and then launch it again under a newer
kernel+KVM, it experiences the same incompatibility.

Unless the VMM realises this problem and opts *out* of the newer KVM
behaviour, of course. This is very much unlike how we *normally* expose
new KVM capabilities.

> > I don't think we ever aspired to be able to hand an arbitrary KVM fd to
> > a userspace VMM and have the VMM be able to drive that VM without
> > having any a priori context, did we?
> 
> Arbitrary? No. This is actually very specific and pretty well
> documented.
> 
> Also, to answer your question about why we treat 0.1 differently from
> 0.2+: 0.1 didn't specify the PSCI SMC/HCR encoding, meaning that KVM
> implemented something that was never fully specified. The VMM has to
> provide firmware tables that describe that. With 0.2+, there is a
> standard encoding for all functions, and the VMM doesn't have to
> provide the encoding to the guest.

Gotcha. So for that case we were *forced* to do things correctly and
allow userspace to opt-in to the capability. While for 0.2 onwards we
got away with this awfulness of silently upgrading the version without
VMM consent.

I was hoping to just follow the existing model of SYSTEM_RESET2 and not
have to touch this awfulness with a barge-pole, but sure, whatever you
want.
Marc Zyngier March 18, 2024, 6:31 p.m. UTC | #5
On Mon, 18 Mar 2024 18:15:36 +0000,
David Woodhouse <dwmw2@infradead.org> wrote:
> 
> [1  <text/plain; UTF-8 (quoted-printable)>]
> On Mon, 2024-03-18 at 17:41 +0000, Marc Zyngier wrote:
> > On Mon, 18 Mar 2024 17:26:07 +0000,
> > David Woodhouse <dwmw2@infradead.org> wrote:
> > > 
> > > [1  <text/plain; UTF-8 (quoted-printable)>]
> > > On Mon, 2024-03-18 at 16:57 +0000, Marc Zyngier wrote:
> > > > 
> > > > > 
> > > > > There *is* a way for a VMM to opt *out* of newer PSCI versions... by 
> > > > > setting a per-vCPU "special" register that actually ends up setting the 
> > > > > PSCI version KVM-wide. Quite why this isn't just a simple KVM_CAP, I 
> > > > > have no idea.
> > > > 
> > > > Because the expectations are that the VMM can blindly save/restore the
> > > > guest's state, including the PSCI version, and restore that blindly.
> > > > KVM CAPs are just a really bad design pattern for this sort of things.
> > > 
> > > Hm, am I missing something here? Does the *guest* get to set the PSCI
> > > version somehow, and opt into the latest version that it understands
> > > regardless of what the firmware/host can support?
> > 
> > No. The *VMM* sets the PSCI version by writing to a pseudo register.
> > It means that when the guest migrates, the VMM saves and restores that
> > version, and the guest doesn't see any change.
> 
> And when you boot a guest image which has been working for years under
> a new kernel+KVM, your guest suddenly experiences a new PSCI version.
> As I said that's not just new optional functions; it's potentially even
> returning new error codes to the functions that said guest was already
> using.

If you want to stick to a given PSCI version, you write the version
you want.

> 
> And when you *hibernate* a guest and then launch it again under a newer
> kernel+KVM, it experiences the same incompatibility.
> 
> Unless the VMM realises this problem and opts *out* of the newer KVM
> behaviour, of course. This is very much unlike how we *normally* expose
> new KVM capabilities.

This was discussed at length 5 or 6 years ago (opt-in vs opt-out).

The feedback from QEMU (which is the only public VMM that does
anything remotely useful with this) was that opt-out was a better
model, specially as PSCI is the conduit for advertising the Spectre
mitigations and users (such as certain cloud vendors) were pretty keen
on guests seeing the mitigations advertised *by default*.

And if you can spot any form of "normality" in the KVM interface, I'll
buy you whatever beer you want. It is all inconsistent crap, so I
think we're in pretty good company here.

> 
> > > I don't think we ever aspired to be able to hand an arbitrary KVM fd to
> > > a userspace VMM and have the VMM be able to drive that VM without
> > > having any a priori context, did we?
> > 
> > Arbitrary? No. This is actually very specific and pretty well
> > documented.
> > 
> > Also, to answer your question about why we treat 0.1 differently from
> > 0.2+: 0.1 didn't specify the PSCI SMC/HCR encoding, meaning that KVM
> > implemented something that was never fully specified. The VMM has to
> > provide firmware tables that describe that. With 0.2+, there is a
> > standard encoding for all functions, and the VMM doesn't have to
> > provide the encoding to the guest.
> 
> Gotcha. So for that case we were *forced* to do things correctly and
> allow userspace to opt-in to the capability. While for 0.2 onwards we
> got away with this awfulness of silently upgrading the version without
> VMM consent.
> 
> I was hoping to just follow the existing model of SYSTEM_RESET2 and not
> have to touch this awfulness with a barge-pole, but sure, whatever you
> want.

Unless I'm reading the whole thing wrong (which isn't impossible given
that I'm jet-lagged to my eyeballs), SYSTEM_RESET2 doesn't have any
form of configuration. If PSCI 1.1 is selected, SYSTEM_RESET2 is
available. So that'd be the model to follow.

	M.
David Woodhouse March 18, 2024, 6:36 p.m. UTC | #6
On Mon, 2024-03-18 at 18:31 +0000, Marc Zyngier wrote:
> On Mon, 18 Mar 2024 18:15:36 +0000,
> David Woodhouse <dwmw2@infradead.org> wrote:
> > 
> > [1  <text/plain; UTF-8 (quoted-printable)>]
> > On Mon, 2024-03-18 at 17:41 +0000, Marc Zyngier wrote:
> > > On Mon, 18 Mar 2024 17:26:07 +0000,
> > > David Woodhouse <dwmw2@infradead.org> wrote:
> > > > 
> > > > [1  <text/plain; UTF-8 (quoted-printable)>]
> > > > On Mon, 2024-03-18 at 16:57 +0000, Marc Zyngier wrote:
> > > > > 
> > > > > > 
> > > > > > There *is* a way for a VMM to opt *out* of newer PSCI versions... by 
> > > > > > setting a per-vCPU "special" register that actually ends up setting the 
> > > > > > PSCI version KVM-wide. Quite why this isn't just a simple KVM_CAP, I 
> > > > > > have no idea.
> > > > > 
> > > > > Because the expectations are that the VMM can blindly save/restore the
> > > > > guest's state, including the PSCI version, and restore that blindly.
> > > > > KVM CAPs are just a really bad design pattern for this sort of things.
> > > > 
> > > > Hm, am I missing something here? Does the *guest* get to set the PSCI
> > > > version somehow, and opt into the latest version that it understands
> > > > regardless of what the firmware/host can support?
> > > 
> > > No. The *VMM* sets the PSCI version by writing to a pseudo register.
> > > It means that when the guest migrates, the VMM saves and restores that
> > > version, and the guest doesn't see any change.
> > 
> > And when you boot a guest image which has been working for years under
> > a new kernel+KVM, your guest suddenly experiences a new PSCI version.
> > As I said that's not just new optional functions; it's potentially even
> > returning new error codes to the functions that said guest was already
> > using.
> 
> If you want to stick to a given PSCI version, you write the version
> you want.
> 
> > 
> > And when you *hibernate* a guest and then launch it again under a newer
> > kernel+KVM, it experiences the same incompatibility.
> > 
> > Unless the VMM realises this problem and opts *out* of the newer KVM
> > behaviour, of course. This is very much unlike how we *normally* expose
> > new KVM capabilities.
> 
> This was discussed at length 5 or 6 years ago (opt-in vs opt-out).
> 
> The feedback from QEMU (which is the only public VMM that does
> anything remotely useful with this) was that opt-out was a better
> model, specially as PSCI is the conduit for advertising the Spectre
> mitigations and users (such as certain cloud vendors) were pretty keen
> on guests seeing the mitigations advertised *by default*.

OK.

> And if you can spot any form of "normality" in the KVM interface, I'll
> buy you whatever beer you want. It is all inconsistent crap, so I
> think we're in pretty good company here.

I'll give you that one :)

> > 
> > > > I don't think we ever aspired to be able to hand an arbitrary KVM fd to
> > > > a userspace VMM and have the VMM be able to drive that VM without
> > > > having any a priori context, did we?
> > > 
> > > Arbitrary? No. This is actually very specific and pretty well
> > > documented.
> > > 
> > > Also, to answer your question about why we treat 0.1 differently from
> > > 0.2+: 0.1 didn't specify the PSCI SMC/HCR encoding, meaning that KVM
> > > implemented something that was never fully specified. The VMM has to
> > > provide firmware tables that describe that. With 0.2+, there is a
> > > standard encoding for all functions, and the VMM doesn't have to
> > > provide the encoding to the guest.
> > 
> > Gotcha. So for that case we were *forced* to do things correctly and
> > allow userspace to opt-in to the capability. While for 0.2 onwards we
> > got away with this awfulness of silently upgrading the version without
> > VMM consent.
> > 
> > I was hoping to just follow the existing model of SYSTEM_RESET2 and not
> > have to touch this awfulness with a barge-pole, but sure, whatever you
> > want.
> 
> Unless I'm reading the whole thing wrong (which isn't impossible given
> that I'm jet-lagged to my eyeballs), SYSTEM_RESET2 doesn't have any
> form of configuration. If PSCI 1.1 is selected, SYSTEM_RESET2 is
> available. So that'd be the model to follow.

Sorry, that was supposed to be SYSTEM_SUSPEND not SYSTEM_RESET2. But
OK.