diff mbox series

[RFC,29/32] KVM: arm64: Pass hypercalls to userspace

Message ID 20230203135043.409192-30-james.morse@arm.com (mailing list archive)
State RFC, archived
Headers show
Series ACPI/arm64: add support for virtual cpuhotplug | expand

Commit Message

James Morse Feb. 3, 2023, 1:50 p.m. UTC
From: Jean-Philippe Brucker <jean-philippe@linaro.org>

When capability KVM_CAP_ARM_HVC_TO_USER is available, userspace can
request to handle all hypercalls that aren't handled by KVM. With the
help of another capability, this will allow userspace to handle PSCI
calls.

Suggested-by: James Morse <james.morse@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: James Morse <james.morse@arm.com>

---

Notes on this implementation:

* A similar mechanism was proposed for SDEI some time ago [1]. This RFC
  generalizes the idea to all hypercalls, since that was suggested on
  the list [2, 3].

* We're reusing kvm_run.hypercall. I copied x0-x5 into
  kvm_run.hypercall.args[] to help userspace but I'm tempted to remove
  this, because:
  - Most user handlers will need to write results back into the
    registers (x0-x3 for SMCCC), so if we keep this shortcut we should
    go all the way and read them back on return to kernel.
  - QEMU doesn't care about this shortcut, it pulls all vcpu regs before
    handling the call.
  - SMCCC uses x0-x16 for parameters.
  x0 does contain the SMCCC function ID and may be useful for fast
  dispatch, we could keep that plus the immediate number.

* Add a flag in the kvm_run.hypercall telling whether this is HVC or
  SMC?  Can be added later in those bottom longmode and pad fields.

* On top of this we could share with userspace which HVC ranges are
  available and which ones are handled by KVM. That can actually be added
  independently, through a vCPU/VM device attribute which doesn't consume
  a new ioctl:
  - userspace issues HAS_ATTR ioctl on the vcpu fd to query whether this
    feature is available.
  - userspace queries the number N of HVC ranges using one GET_ATTR.
  - userspace passes an array of N ranges using another GET_ATTR. The
    array is filled and returned by KVM.

* Enabling this using a vCPU arch feature rather than the whole-VM
  capability would be fine, but it would be difficult to do the same for
  the following psci-in-user capability. So let's enable everything at
  the VM scope.

* No idea whether this work out of the box for AArch32 guests.

[1] https://lore.kernel.org/linux-arm-kernel/20170808164616.25949-12-james.morse@arm.com/
[2] https://lore.kernel.org/linux-arm-kernel/bf7e83f1-c58e-8d65-edd0-d08f27b8b766@arm.com/
[3] https://lore.kernel.org/linux-arm-kernel/f56cf420-affc-35f0-2355-801a924b8a35@arm.com/
---
 Documentation/virt/kvm/api.rst    | 17 +++++++++++++++--
 arch/arm64/include/asm/kvm_host.h |  1 +
 arch/arm64/kvm/arm.c              |  5 +++++
 arch/arm64/kvm/hypercalls.c       | 28 +++++++++++++++++++++++++++-
 include/kvm/arm_psci.h            |  4 ++++
 include/uapi/linux/kvm.h          |  1 +
 6 files changed, 53 insertions(+), 3 deletions(-)

Comments

Oliver Upton Feb. 3, 2023, 9:08 p.m. UTC | #1
Hi James,

On Fri, Feb 03, 2023 at 01:50:40PM +0000, James Morse wrote:
> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> 
> When capability KVM_CAP_ARM_HVC_TO_USER is available, userspace can
> request to handle all hypercalls that aren't handled by KVM.

I would very much prefer we not go down this route. This capability
effectively constructs an ABI out of what KVM presently does not
implement. What would happen if KVM decides to implement a new set
of hypercalls later down the road that were previously forwarded to
userspace?

Instead of a catch-all I think we should take the approach of having
userspace explicitly request which hypercalls should be forwarded to
userspace. I proposed something similar [1], but never got around to
respinning it (oops).

Let me dust those patches off and align with Marc's suggestions.

[1]: https://lore.kernel.org/kvmarm/20221110015327.3389351-1-oliver.upton@linux.dev/

--
Thanks,
Oliver
Marc Zyngier Feb. 5, 2023, 10:12 a.m. UTC | #2
On Fri, 03 Feb 2023 13:50:40 +0000,
James Morse <james.morse@arm.com> wrote:
> 
> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> 
> When capability KVM_CAP_ARM_HVC_TO_USER is available, userspace can
> request to handle all hypercalls that aren't handled by KVM. With the
> help of another capability, this will allow userspace to handle PSCI
> calls.
> 
> Suggested-by: James Morse <james.morse@arm.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: James Morse <james.morse@arm.com>
> 
> ---
> 

On top of Oliver's ask not to make this a blanket "steal everything",
but instead to have an actual request for ranges of forwarded
hypercalls:

> Notes on this implementation:
> 
> * A similar mechanism was proposed for SDEI some time ago [1]. This RFC
>   generalizes the idea to all hypercalls, since that was suggested on
>   the list [2, 3].
> 
> * We're reusing kvm_run.hypercall. I copied x0-x5 into
>   kvm_run.hypercall.args[] to help userspace but I'm tempted to remove
>   this, because:
>   - Most user handlers will need to write results back into the
>     registers (x0-x3 for SMCCC), so if we keep this shortcut we should
>     go all the way and read them back on return to kernel.
>   - QEMU doesn't care about this shortcut, it pulls all vcpu regs before
>     handling the call.
>   - SMCCC uses x0-x16 for parameters.
>   x0 does contain the SMCCC function ID and may be useful for fast
>   dispatch, we could keep that plus the immediate number.
> 
> * Add a flag in the kvm_run.hypercall telling whether this is HVC or
>   SMC?  Can be added later in those bottom longmode and pad fields.

We definitely need this. A nested hypervisor can (and does) use SMCs
as the conduit. The question is whether they represent two distinct
namespaces or not. I *think* we can unify them, but someone should
check and maybe get clarification from the owners of the SMCCC spec.

>
> * On top of this we could share with userspace which HVC ranges are
>   available and which ones are handled by KVM. That can actually be added
>   independently, through a vCPU/VM device attribute which doesn't consume
>   a new ioctl:
>   - userspace issues HAS_ATTR ioctl on the vcpu fd to query whether this
>     feature is available.
>   - userspace queries the number N of HVC ranges using one GET_ATTR.
>   - userspace passes an array of N ranges using another GET_ATTR. The
>     array is filled and returned by KVM.

As mentioned above, I think this interface should go both ways.
Userspace should request the forwarding of a certain range of
hypercalls via a similar SET_ATTR interface.

Another question is how we migrate VMs that have these forwarding
requirements. Do we expect the VMM to replay the forwarding as part of
the setting up on the other side? Or do we save/restore this via a
firmware pseudo-register?

> 
> * Enabling this using a vCPU arch feature rather than the whole-VM
>   capability would be fine, but it would be difficult to do the same for
>   the following psci-in-user capability. So let's enable everything at
>   the VM scope.

Absolutely.

Thanks,

	M.
Suzuki K Poulose Feb. 6, 2023, 10:10 a.m. UTC | #3
Hi,

A few cents from the Realm support point of view.

On 05/02/2023 10:12, Marc Zyngier wrote:
> On Fri, 03 Feb 2023 13:50:40 +0000,
> James Morse <james.morse@arm.com> wrote:
>>
>> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>
>> When capability KVM_CAP_ARM_HVC_TO_USER is available, userspace can
>> request to handle all hypercalls that aren't handled by KVM. With the
>> help of another capability, this will allow userspace to handle PSCI
>> calls.
>>
>> Suggested-by: James Morse <james.morse@arm.com>
>> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
>> Signed-off-by: James Morse <james.morse@arm.com>
>>
>> ---
>>
> 
> On top of Oliver's ask not to make this a blanket "steal everything",
> but instead to have an actual request for ranges of forwarded
> hypercalls:
> 
>> Notes on this implementation:
>>
>> * A similar mechanism was proposed for SDEI some time ago [1]. This RFC
>>    generalizes the idea to all hypercalls, since that was suggested on
>>    the list [2, 3].
>>
>> * We're reusing kvm_run.hypercall. I copied x0-x5 into
>>    kvm_run.hypercall.args[] to help userspace but I'm tempted to remove
>>    this, because:
>>    - Most user handlers will need to write results back into the
>>      registers (x0-x3 for SMCCC), so if we keep this shortcut we should
>>      go all the way and read them back on return to kernel.
>>    - QEMU doesn't care about this shortcut, it pulls all vcpu regs before
>>      handling the call.

This may not be always possible, e.g., for Realms. GET_ONE_REG is
not supported. So using an explicit passing down of the args is
preferrable.

Thanks
Suzuki
Marc Zyngier Feb. 6, 2023, 12:31 p.m. UTC | #4
On Mon, 06 Feb 2023 10:10:41 +0000,
Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
> 
> Hi,
> 
> A few cents from the Realm support point of view.
> 
> On 05/02/2023 10:12, Marc Zyngier wrote:
> > On Fri, 03 Feb 2023 13:50:40 +0000,
> > James Morse <james.morse@arm.com> wrote:
> >> 
> >> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> >> 
> >> When capability KVM_CAP_ARM_HVC_TO_USER is available, userspace can
> >> request to handle all hypercalls that aren't handled by KVM. With the
> >> help of another capability, this will allow userspace to handle PSCI
> >> calls.
> >> 
> >> Suggested-by: James Morse <james.morse@arm.com>
> >> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> >> Signed-off-by: James Morse <james.morse@arm.com>
> >> 
> >> ---
> >> 
> > 
> > On top of Oliver's ask not to make this a blanket "steal everything",
> > but instead to have an actual request for ranges of forwarded
> > hypercalls:
> > 
> >> Notes on this implementation:
> >> 
> >> * A similar mechanism was proposed for SDEI some time ago [1]. This RFC
> >>    generalizes the idea to all hypercalls, since that was suggested on
> >>    the list [2, 3].
> >> 
> >> * We're reusing kvm_run.hypercall. I copied x0-x5 into
> >>    kvm_run.hypercall.args[] to help userspace but I'm tempted to remove
> >>    this, because:
> >>    - Most user handlers will need to write results back into the
> >>      registers (x0-x3 for SMCCC), so if we keep this shortcut we should
> >>      go all the way and read them back on return to kernel.
> >>    - QEMU doesn't care about this shortcut, it pulls all vcpu regs before
> >>      handling the call.
> 
> This may not be always possible, e.g., for Realms. GET_ONE_REG is
> not supported. So using an explicit passing down of the args is
> preferrable.

What is the blocker for CCA to use GET_ONE_REG? The value obviously
exists and is made available to the host. pKVM is perfectly able to
use GET_ONE_REG and gets a bunch of zeroes for things that the
hypervisor has decided to hide from the host.

Of course, it requires that the hypervisor (the RMM in your case)
knows about the semantics of the hypercall, but that's obviously
already a requirement (or you wouldn't be able to use PSCI at all).

Thanks,

	M.
Oliver Upton Feb. 6, 2023, 5:19 p.m. UTC | #5
On Sun, Feb 05, 2023 at 10:12:49AM +0000, Marc Zyngier wrote:
> On Fri, 03 Feb 2023 13:50:40 +0000,
> James Morse <james.morse@arm.com> wrote:
> > 
> > From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > 
> > When capability KVM_CAP_ARM_HVC_TO_USER is available, userspace can
> > request to handle all hypercalls that aren't handled by KVM. With the
> > help of another capability, this will allow userspace to handle PSCI
> > calls.
> > 
> > Suggested-by: James Morse <james.morse@arm.com>
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: James Morse <james.morse@arm.com>
> > 
> > ---
> > 
> 
> On top of Oliver's ask not to make this a blanket "steal everything",
> but instead to have an actual request for ranges of forwarded
> hypercalls:
> 
> > Notes on this implementation:
> > 
> > * A similar mechanism was proposed for SDEI some time ago [1]. This RFC
> >   generalizes the idea to all hypercalls, since that was suggested on
> >   the list [2, 3].
> > 
> > * We're reusing kvm_run.hypercall. I copied x0-x5 into
> >   kvm_run.hypercall.args[] to help userspace but I'm tempted to remove
> >   this, because:
> >   - Most user handlers will need to write results back into the
> >     registers (x0-x3 for SMCCC), so if we keep this shortcut we should
> >     go all the way and read them back on return to kernel.
> >   - QEMU doesn't care about this shortcut, it pulls all vcpu regs before
> >     handling the call.
> >   - SMCCC uses x0-x16 for parameters.
> >   x0 does contain the SMCCC function ID and may be useful for fast
> >   dispatch, we could keep that plus the immediate number.
> > 
> > * Add a flag in the kvm_run.hypercall telling whether this is HVC or
> >   SMC?  Can be added later in those bottom longmode and pad fields.
> 
> We definitely need this. A nested hypervisor can (and does) use SMCs
> as the conduit. The question is whether they represent two distinct
> namespaces or not. I *think* we can unify them, but someone should
> check and maybe get clarification from the owners of the SMCCC spec.
> 
> >
> > * On top of this we could share with userspace which HVC ranges are
> >   available and which ones are handled by KVM. That can actually be added
> >   independently, through a vCPU/VM device attribute which doesn't consume
> >   a new ioctl:
> >   - userspace issues HAS_ATTR ioctl on the vcpu fd to query whether this
> >     feature is available.
> >   - userspace queries the number N of HVC ranges using one GET_ATTR.
> >   - userspace passes an array of N ranges using another GET_ATTR. The
> >     array is filled and returned by KVM.
> 
> As mentioned above, I think this interface should go both ways.
> Userspace should request the forwarding of a certain range of
> hypercalls via a similar SET_ATTR interface.
> 
> Another question is how we migrate VMs that have these forwarding
> requirements. Do we expect the VMM to replay the forwarding as part of
> the setting up on the other side? Or do we save/restore this via a
> firmware pseudo-register?

Personally I'd prefer we left that job to userspace.

We could also implement GET_ATTR, in case userspace has forgotten what
it wrote to the hypercall filter. The firmware pseudo-registers are
handy for moving KVM state back and forth 'for free', but I don't think
we need to bend over backwards to migrate state userspace is directly
responsible for.
Suzuki K Poulose Feb. 7, 2023, 9:41 a.m. UTC | #6
Hi Marc,

On 06/02/2023 12:31, Marc Zyngier wrote:
> On Mon, 06 Feb 2023 10:10:41 +0000,
> Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
>>
>> Hi,
>>
>> A few cents from the Realm support point of view.
>>
>> On 05/02/2023 10:12, Marc Zyngier wrote:
>>> On Fri, 03 Feb 2023 13:50:40 +0000,
>>> James Morse <james.morse@arm.com> wrote:
>>>>
>>>> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>>>
>>>> When capability KVM_CAP_ARM_HVC_TO_USER is available, userspace can
>>>> request to handle all hypercalls that aren't handled by KVM. With the
>>>> help of another capability, this will allow userspace to handle PSCI
>>>> calls.
>>>>
>>>> Suggested-by: James Morse <james.morse@arm.com>
>>>> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>>> Signed-off-by: James Morse <james.morse@arm.com>
>>>>
>>>> ---
>>>>
>>>
>>> On top of Oliver's ask not to make this a blanket "steal everything",
>>> but instead to have an actual request for ranges of forwarded
>>> hypercalls:
>>>
>>>> Notes on this implementation:
>>>>
>>>> * A similar mechanism was proposed for SDEI some time ago [1]. This RFC
>>>>     generalizes the idea to all hypercalls, since that was suggested on
>>>>     the list [2, 3].
>>>>
>>>> * We're reusing kvm_run.hypercall. I copied x0-x5 into
>>>>     kvm_run.hypercall.args[] to help userspace but I'm tempted to remove
>>>>     this, because:
>>>>     - Most user handlers will need to write results back into the
>>>>       registers (x0-x3 for SMCCC), so if we keep this shortcut we should
>>>>       go all the way and read them back on return to kernel.
>>>>     - QEMU doesn't care about this shortcut, it pulls all vcpu regs before
>>>>       handling the call.
>>
>> This may not be always possible, e.g., for Realms. GET_ONE_REG is
>> not supported. So using an explicit passing down of the args is
>> preferrable.
> 
> What is the blocker for CCA to use GET_ONE_REG? The value obviously
> exists and is made available to the host. pKVM is perfectly able to
> use GET_ONE_REG and gets a bunch of zeroes for things that the
> hypervisor has decided to hide from the host.
> 

It is not impossible. On a "HOST CALL" (explicit calls to the Host from
Realm), the GPRs are made available to the host and can be stashed into 
the vcpu reg state and the request can be serviced. However, it is a bit
odd, to make this exception - "the GET_ONE_REG is valid now", while in 
almost all other cases it is invalid (exception of MMIO).

Of course we could always return what is stashed in the vcpu state,
which is may be invalid/ 0. But given the construct of "host doesn't
have access to the register state", it may be a good idea to say, 
request always fails, to indicate that the Host is probably doing 
something wrong, than silently passing on incorrect information.


> Of course, it requires that the hypervisor (the RMM in your case)
> knows about the semantics of the hypercall, but that's obviously

RMM doesn't care about the semantics of hypercall, other than
considering it just like an SMCCC compliant call. The hypercall
arguments/results are passed down/up by the Realm in a separate structure.

> already a requirement (or you wouldn't be able to use PSCI at all).

Realm PSCI calls are always serviced by the RMM. RMM may request
the Hyp for specific information in certain cases, but that doesn't
need to go down to the VMM.

Thanks
Suzuki


> 
> Thanks,
> 
> 	M.
>
Marc Zyngier Feb. 7, 2023, 11:23 a.m. UTC | #7
On Tue, 07 Feb 2023 09:41:54 +0000,
Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
> 
> Hi Marc,
> 
> On 06/02/2023 12:31, Marc Zyngier wrote:
> > On Mon, 06 Feb 2023 10:10:41 +0000,
> > Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
> >> 
> >> This may not be always possible, e.g., for Realms. GET_ONE_REG is
> >> not supported. So using an explicit passing down of the args is
> >> preferrable.
> > 
> > What is the blocker for CCA to use GET_ONE_REG? The value obviously
> > exists and is made available to the host. pKVM is perfectly able to
> > use GET_ONE_REG and gets a bunch of zeroes for things that the
> > hypervisor has decided to hide from the host.
> > 
> 
> It is not impossible. On a "HOST CALL" (explicit calls to the Host
> from Realm), the GPRs are made available to the host and can be
> stashed into the vcpu reg state and the request can be
> serviced. However, it is a bit odd, to make this exception - "the
> GET_ONE_REG is valid now", while in almost all other cases it is
> invalid (exception of MMIO).

But that's an RMM decision. If the RMM decides to forward the
hypercall to the host (irrespective of the potential forwarding to
userspace), it makes the GPRs available.

If the hypercall is forwarded to userspace, then the host is
responsible to check with the RMM that it will be willing to provide
the required information (passed as GPRs or not).

> Of course we could always return what is stashed in the vcpu state,
> which is may be invalid/ 0. But given the construct of "host doesn't
> have access to the register state", it may be a good idea to say,
> request always fails, to indicate that the Host is probably doing
> something wrong, than silently passing on incorrect information.

I disagree. Either you fail at the delegation point, or you don't. On
getting a hypercall exit to userspace, you are guaranteed that the
GPRs are valid.

> > Of course, it requires that the hypervisor (the RMM in your case)
> > knows about the semantics of the hypercall, but that's obviously
> 
> RMM doesn't care about the semantics of hypercall, other than
> considering it just like an SMCCC compliant call. The hypercall
> arguments/results are passed down/up by the Realm in a separate
> structure.

That's because the RMM doesn't use registers to pass the data. But at
the end of the day, this is the same thing. The host gets the data
from the RMM, stashes it in the GPRs, and exit to userspace.

The important thing here is that GET_ONE_REG is valid in the context
where it matters. If the VMM tries to use it outside of the context of
a hypercall, it gets junk. It's not a bug, it's a feature.

Thanks,

	M.
Suzuki K Poulose Feb. 7, 2023, 12:46 p.m. UTC | #8
On 07/02/2023 11:23, Marc Zyngier wrote:
> On Tue, 07 Feb 2023 09:41:54 +0000,
> Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
>>
>> Hi Marc,
>>
>> On 06/02/2023 12:31, Marc Zyngier wrote:
>>> On Mon, 06 Feb 2023 10:10:41 +0000,
>>> Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
>>>>
>>>> This may not be always possible, e.g., for Realms. GET_ONE_REG is
>>>> not supported. So using an explicit passing down of the args is
>>>> preferrable.
>>>
>>> What is the blocker for CCA to use GET_ONE_REG? The value obviously
>>> exists and is made available to the host. pKVM is perfectly able to
>>> use GET_ONE_REG and gets a bunch of zeroes for things that the
>>> hypervisor has decided to hide from the host.
>>>
>>
>> It is not impossible. On a "HOST CALL" (explicit calls to the Host
>> from Realm), the GPRs are made available to the host and can be
>> stashed into the vcpu reg state and the request can be
>> serviced. However, it is a bit odd, to make this exception - "the
>> GET_ONE_REG is valid now", while in almost all other cases it is
>> invalid (exception of MMIO).
> 
> But that's an RMM decision. If the RMM decides to forward the
> hypercall to the host (irrespective of the potential forwarding to
> userspace), it makes the GPRs available.
> 
> If the hypercall is forwarded to userspace, then the host is
> responsible to check with the RMM that it will be willing to provide
> the required information (passed as GPRs or not).

Just to be clear, on a hypercall, all the arguments are provided to
the host. And it is always possible for the host to sync the vcpu
GPR state with those arguments and make them available via the 
GET_ONE_REG call.

> 
>> Of course we could always return what is stashed in the vcpu state,
>> which is may be invalid/ 0. But given the construct of "host doesn't
>> have access to the register state", it may be a good idea to say,
>> request always fails, to indicate that the Host is probably doing
>> something wrong, than silently passing on incorrect information.
> 
> I disagree. Either you fail at the delegation point, or you don't. On
> getting a hypercall exit to userspace, you are guaranteed that the
> GPRs are valid.

This is possible, as I mentioned below, the question is bug vs feature.

> 
>>> Of course, it requires that the hypervisor (the RMM in your case)
>>> knows about the semantics of the hypercall, but that's obviously
>>
>> RMM doesn't care about the semantics of hypercall, other than
>> considering it just like an SMCCC compliant call. The hypercall
>> arguments/results are passed down/up by the Realm in a separate
>> structure.
> 
> That's because the RMM doesn't use registers to pass the data. But at
> the end of the day, this is the same thing. The host gets the data
> from the RMM, stashes it in the GPRs, and exit to userspace.

True.

> 
> The important thing here is that GET_ONE_REG is valid in the context
> where it matters. If the VMM tries to use it outside of the context of
> a hypercall, it gets junk. It's not a bug, it's a feature.

This is what I was concerned about.  As long as this "For any exit
other than hypercall (at least for now), you get junk values when using
GET_ONE_REG for confidential guests" is an acceptable feature, that 
should be alright.

Thanks
Suzuki

> 
> Thanks,
> 
> 	M.
>
James Morse Feb. 7, 2023, 5:50 p.m. UTC | #9
Hi Marc,

On 05/02/2023 10:12, Marc Zyngier wrote:
> On Fri, 03 Feb 2023 13:50:40 +0000,
> James Morse <james.morse@arm.com> wrote:
>>
>> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>
>> When capability KVM_CAP_ARM_HVC_TO_USER is available, userspace can
>> request to handle all hypercalls that aren't handled by KVM. With the
>> help of another capability, this will allow userspace to handle PSCI
>> calls.

> On top of Oliver's ask not to make this a blanket "steal everything",
> but instead to have an actual request for ranges of forwarded
> hypercalls:
> 
>> Notes on this implementation:
>>
>> * A similar mechanism was proposed for SDEI some time ago [1]. This RFC
>>   generalizes the idea to all hypercalls, since that was suggested on
>>   the list [2, 3].
>>
>> * We're reusing kvm_run.hypercall. I copied x0-x5 into
>>   kvm_run.hypercall.args[] to help userspace but I'm tempted to remove
>>   this, because:
>>   - Most user handlers will need to write results back into the
>>     registers (x0-x3 for SMCCC), so if we keep this shortcut we should
>>     go all the way and read them back on return to kernel.
>>   - QEMU doesn't care about this shortcut, it pulls all vcpu regs before
>>     handling the call.
>>   - SMCCC uses x0-x16 for parameters.
>>   x0 does contain the SMCCC function ID and may be useful for fast
>>   dispatch, we could keep that plus the immediate number.
>>
>> * Add a flag in the kvm_run.hypercall telling whether this is HVC or
>>   SMC?  Can be added later in those bottom longmode and pad fields.

> We definitely need this. A nested hypervisor can (and does) use SMCs
> as the conduit.

Christoffer's comments last time round on this was that EL2 guests get SMC with this,
and EL1 guests get HVC. The VMM could never get both...


> The question is whether they represent two distinct
> namespaces or not. I *think* we can unify them, but someone should
> check and maybe get clarification from the owners of the SMCCC spec.

i.e. the VMM requests 0xC400_0000:0xC400_001F regardless of SMC/HVC?

I don't yet see how a VMM could get HVC out of a virtual-EL2 guest....


>> * On top of this we could share with userspace which HVC ranges are
>>   available and which ones are handled by KVM. That can actually be added
>>   independently, through a vCPU/VM device attribute which doesn't consume
>>   a new ioctl:
>>   - userspace issues HAS_ATTR ioctl on the vcpu fd to query whether this
>>     feature is available.
>>   - userspace queries the number N of HVC ranges using one GET_ATTR.
>>   - userspace passes an array of N ranges using another GET_ATTR. The
>>     array is filled and returned by KVM.

> As mentioned above, I think this interface should go both ways.
> Userspace should request the forwarding of a certain range of
> hypercalls via a similar SET_ATTR interface.

Yup, I'll sync up with Oliver about that.


> Another question is how we migrate VMs that have these forwarding
> requirements. Do we expect the VMM to replay the forwarding as part of
> the setting up on the other side? Or do we save/restore this via a
> firmware pseudo-register?

Pfff. VMMs problem. Enabling these things means it has its own internal state to migrate.
(is this vCPU on or off?), I doubt it needs reminding that the state exists.

That said, Salil is looking at making this work with migration in Qemu.


Thanks,

James
James Morse Feb. 7, 2023, 5:50 p.m. UTC | #10
Hi Oliver,

On 03/02/2023 21:08, Oliver Upton wrote:
> On Fri, Feb 03, 2023 at 01:50:40PM +0000, James Morse wrote:
>> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>
>> When capability KVM_CAP_ARM_HVC_TO_USER is available, userspace can
>> request to handle all hypercalls that aren't handled by KVM.

> I would very much prefer we not go down this route. This capability
> effectively constructs an ABI out of what KVM presently does not
> implement. What would happen if KVM decides to implement a new set
> of hypercalls later down the road that were previously forwarded to
> userspace?

The user-space support would never get called. If we have a wild-west allocation of IDs in
this area we have bigger problems. I'd hope in this example it would be a VMM or an
in-kernel implementation of the same feature.

When I floated something like this before for supporting SDEI in guests, Christoffer
didn't like tie-ing KVM to SMC-CC - hence the all or nothing.

Since then we've had things like Spectre, which I don't think the VMM should
ever be allowed to handle, which makes the whole thing much murkier.


> Instead of a catch-all I think we should take the approach of having
> userspace explicitly request which hypercalls should be forwarded to
> userspace. I proposed something similar [1], but never got around to
> respinning it (oops).

> Let me dust those patches off and align with Marc's suggestions.
> 
> [1]: https://lore.kernel.org/kvmarm/20221110015327.3389351-1-oliver.upton@linux.dev/

I've no problem with doing it like this. This approach was based on Christoffer's previous
feedback, but the world has changed since then.

Let me know if you want me to re-spin that series - I need to get this into some
shape next week for Salil to look at the Qemu changes, as I can't test the whole thing
until that is done.



Thanks,

James
Marc Zyngier Feb. 8, 2023, 8:40 a.m. UTC | #11
On Tue, 07 Feb 2023 17:50:58 +0000,
James Morse <james.morse@arm.com> wrote:
> 
> Hi Marc,
> 
> On 05/02/2023 10:12, Marc Zyngier wrote:
> > On Fri, 03 Feb 2023 13:50:40 +0000,
> > James Morse <james.morse@arm.com> wrote:
> >>
> >> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> >>
> >> When capability KVM_CAP_ARM_HVC_TO_USER is available, userspace can
> >> request to handle all hypercalls that aren't handled by KVM. With the
> >> help of another capability, this will allow userspace to handle PSCI
> >> calls.
> 
> > On top of Oliver's ask not to make this a blanket "steal everything",
> > but instead to have an actual request for ranges of forwarded
> > hypercalls:
> > 
> >> Notes on this implementation:
> >>
> >> * A similar mechanism was proposed for SDEI some time ago [1]. This RFC
> >>   generalizes the idea to all hypercalls, since that was suggested on
> >>   the list [2, 3].
> >>
> >> * We're reusing kvm_run.hypercall. I copied x0-x5 into
> >>   kvm_run.hypercall.args[] to help userspace but I'm tempted to remove
> >>   this, because:
> >>   - Most user handlers will need to write results back into the
> >>     registers (x0-x3 for SMCCC), so if we keep this shortcut we should
> >>     go all the way and read them back on return to kernel.
> >>   - QEMU doesn't care about this shortcut, it pulls all vcpu regs before
> >>     handling the call.
> >>   - SMCCC uses x0-x16 for parameters.
> >>   x0 does contain the SMCCC function ID and may be useful for fast
> >>   dispatch, we could keep that plus the immediate number.
> >>
> >> * Add a flag in the kvm_run.hypercall telling whether this is HVC or
> >>   SMC?  Can be added later in those bottom longmode and pad fields.
> 
> > We definitely need this. A nested hypervisor can (and does) use SMCs
> > as the conduit.
> 
> Christoffer's comments last time round on this was that EL2 guests
> get SMC with this, and EL1 guests get HVC. The VMM could never get
> both...

I agree with the first half of the statement (EL2 guest using SMC),
but limiting EL1 guests to HVC is annoying. On systems that have a
secure side, it would make sense to be able to route the guest's SMC
calls to userspace and allow it to emulate/proxy/deny such calls.

This would solve the 10 year old question of "how do we allow a guest
to call into secure services...

> 
> 
> > The question is whether they represent two distinct
> > namespaces or not. I *think* we can unify them, but someone should
> > check and maybe get clarification from the owners of the SMCCC spec.
> 
> i.e. the VMM requests 0xC400_0000:0xC400_001F regardless of SMC/HVC?
> 
> I don't yet see how a VMM could get HVC out of a virtual-EL2 guest....

My statement was badly formulated, and I conflated the need for SMC in
EL2 guests with the (separate) need to handle SMC for EL1 guests.

>
> 
> >> * On top of this we could share with userspace which HVC ranges are
> >>   available and which ones are handled by KVM. That can actually be added
> >>   independently, through a vCPU/VM device attribute which doesn't consume
> >>   a new ioctl:
> >>   - userspace issues HAS_ATTR ioctl on the vcpu fd to query whether this
> >>     feature is available.
> >>   - userspace queries the number N of HVC ranges using one GET_ATTR.
> >>   - userspace passes an array of N ranges using another GET_ATTR. The
> >>     array is filled and returned by KVM.
> 
> > As mentioned above, I think this interface should go both ways.
> > Userspace should request the forwarding of a certain range of
> > hypercalls via a similar SET_ATTR interface.
> 
> Yup, I'll sync up with Oliver about that.
> 
> 
> > Another question is how we migrate VMs that have these forwarding
> > requirements. Do we expect the VMM to replay the forwarding as part of
> > the setting up on the other side? Or do we save/restore this via a
> > firmware pseudo-register?
> 
> Pfff. VMMs problem. Enabling these things means it has its own
> internal state to migrate.  (is this vCPU on or off?), I doubt it
> needs reminding that the state exists.

I'm perfectly OK with the VMM being in the driving seat here and that
it'd have to replay its own state. But it needs some level of
documentation.

> That said, Salil is looking at making this work with migration in Qemu.

Yup, that'd be needed.

Thanks,

	M.
Marc Zyngier Feb. 8, 2023, 9:02 a.m. UTC | #12
On Tue, 07 Feb 2023 17:50:59 +0000,
James Morse <james.morse@arm.com> wrote:
> 
> Hi Oliver,
> 
> On 03/02/2023 21:08, Oliver Upton wrote:
> > On Fri, Feb 03, 2023 at 01:50:40PM +0000, James Morse wrote:
> >> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> >>
> >> When capability KVM_CAP_ARM_HVC_TO_USER is available, userspace can
> >> request to handle all hypercalls that aren't handled by KVM.
> 
> > I would very much prefer we not go down this route. This capability
> > effectively constructs an ABI out of what KVM presently does not
> > implement. What would happen if KVM decides to implement a new set
> > of hypercalls later down the road that were previously forwarded to
> > userspace?
> 
> The user-space support would never get called. If we have a
> wild-west allocation of IDs in this area we have bigger
> problems. I'd hope in this example it would be a VMM or an in-kernel
> implementation of the same feature.
> 
> When I floated something like this before for supporting SDEI in
> guests, Christoffer didn't like tie-ing KVM to SMC-CC - hence the
> all or nothing.
> 
> Since then we've had things like Spectre, which I don't think the
> VMM should ever be allowed to handle, which makes the whole thing
> much murkier.

That ship has sailed a long time ago. We also have grown a bunch of
in-kernel SMCCC services that are KVM specific (the silly PTP stuff,
for example, not to mention all the pKVM hypercalls...).

It is also likely that these ranges will grow over time (it has been a
long time since the last drop of Spectre-like crap, and something must
be brewing somewhere), so a level of discrimination is important.

	M.
Marc Zyngier Feb. 8, 2023, 2:25 p.m. UTC | #13
On Wed, 08 Feb 2023 08:40:09 +0000,
Marc Zyngier <maz@kernel.org> wrote:
> 
> On Tue, 07 Feb 2023 17:50:58 +0000,
> James Morse <james.morse@arm.com> wrote:
> > 
> > Hi Marc,
> > 
> > On 05/02/2023 10:12, Marc Zyngier wrote:
> > > On Fri, 03 Feb 2023 13:50:40 +0000,
> > > James Morse <james.morse@arm.com> wrote:
> > >>
> > >> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > >>
> > >> When capability KVM_CAP_ARM_HVC_TO_USER is available, userspace can
> > >> request to handle all hypercalls that aren't handled by KVM. With the
> > >> help of another capability, this will allow userspace to handle PSCI
> > >> calls.
> > 
> > > On top of Oliver's ask not to make this a blanket "steal everything",
> > > but instead to have an actual request for ranges of forwarded
> > > hypercalls:
> > > 
> > >> Notes on this implementation:
> > >>
> > >> * A similar mechanism was proposed for SDEI some time ago [1]. This RFC
> > >>   generalizes the idea to all hypercalls, since that was suggested on
> > >>   the list [2, 3].
> > >>
> > >> * We're reusing kvm_run.hypercall. I copied x0-x5 into
> > >>   kvm_run.hypercall.args[] to help userspace but I'm tempted to remove
> > >>   this, because:
> > >>   - Most user handlers will need to write results back into the
> > >>     registers (x0-x3 for SMCCC), so if we keep this shortcut we should
> > >>     go all the way and read them back on return to kernel.
> > >>   - QEMU doesn't care about this shortcut, it pulls all vcpu regs before
> > >>     handling the call.
> > >>   - SMCCC uses x0-x16 for parameters.
> > >>   x0 does contain the SMCCC function ID and may be useful for fast
> > >>   dispatch, we could keep that plus the immediate number.
> > >>
> > >> * Add a flag in the kvm_run.hypercall telling whether this is HVC or
> > >>   SMC?  Can be added later in those bottom longmode and pad fields.
> > 
> > > We definitely need this. A nested hypervisor can (and does) use SMCs
> > > as the conduit.
> > 
> > Christoffer's comments last time round on this was that EL2 guests
> > get SMC with this, and EL1 guests get HVC. The VMM could never get
> > both...
> 
> I agree with the first half of the statement (EL2 guest using SMC),
> but limiting EL1 guests to HVC is annoying. On systems that have a
> secure side, it would make sense to be able to route the guest's SMC
> calls to userspace and allow it to emulate/proxy/deny such calls.

You also want to look at the TRNG firmware spec (aka DEN0098), which
explicitly calls out for the use of SMC when EL2 and EL3 are
implemented (see 1.5 "Invocation considerations").

Is it mad? Yes. But madness seems to be the direction of travel these
days.

	M.
Oliver Upton Feb. 11, 2023, 1:44 a.m. UTC | #14
Hey James,

On Tue, Feb 07, 2023 at 05:50:58PM +0000, James Morse wrote:

[...]

> > As mentioned above, I think this interface should go both ways.
> > Userspace should request the forwarding of a certain range of
> > hypercalls via a similar SET_ATTR interface.
> 
> Yup, I'll sync up with Oliver about that.

Sorry, I was tied up with some unrelated stuff at work this week. I
polished up what I had and sent out an RFC v2 for SMCCC filtering [1].
Mentioning it here because my email service gives up if I Cc too many
recipients so I didn't include everyone on this series.

[1]: https://lore.kernel.org/linux-arm-kernel/20230211013759.3556016-1-oliver.upton@linux.dev/
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index deb494f759ed..9a28a9cc1163 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6122,8 +6122,12 @@  to the byte array.
 			__u32 pad;
 		} hypercall;
 
-Unused.  This was once used for 'hypercall to userspace'.  To implement
-such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390).
+On x86 this was once used for 'hypercall to userspace'.  To implement such
+functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390).
+
+On arm64 it is used for hypercalls, when the KVM_CAP_ARM_HVC_TO_USER capability
+is enabled. 'nr' contains the HVC or SMC immediate. 'args' contains registers
+x0 - x5. The other parameters are unused.
 
 .. note:: KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO.
 
@@ -8276,6 +8280,15 @@  structure.
 When getting the Modified Change Topology Report value, the attr->addr
 must point to a byte where the value will be stored or retrieved from.
 
+8.37 KVM_CAP_ARM_HVC_TO_USER
+----------------------------
+
+:Architecture: arm64
+
+This capability indicates that KVM can pass unhandled hypercalls to userspace,
+if the VMM enables it. Hypercalls are passed with KVM_EXIT_HYPERCALL in
+kvm_run::hypercall.
+
 9. Known KVM API problems
 =========================
 
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 35a159d131b5..40911ebfa710 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -213,6 +213,7 @@  struct kvm_arch {
 #define KVM_ARCH_FLAG_EL1_32BIT				4
 	/* PSCI SYSTEM_SUSPEND enabled for the guest */
 #define KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED		5
+#define KVM_ARCH_FLAG_HVC_TO_USER			6
 
 	unsigned long flags;
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 9c5573bc4614..815b7e8f88e1 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -101,6 +101,10 @@  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		r = 0;
 		set_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags);
 		break;
+	case KVM_CAP_ARM_HVC_TO_USER:
+		r = 0;
+		set_bit(KVM_ARCH_FLAG_HVC_TO_USER, &kvm->arch.flags);
+		break;
 	default:
 		r = -EINVAL;
 		break;
@@ -230,6 +234,7 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VCPU_ATTRIBUTES:
 	case KVM_CAP_PTP_KVM:
 	case KVM_CAP_ARM_SYSTEM_SUSPEND:
+	case KVM_CAP_ARM_HVC_TO_USER:
 		r = 1;
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index c9f401fa01a9..efaf05d40dab 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -121,6 +121,28 @@  static bool kvm_hvc_call_allowed(struct kvm_vcpu *vcpu, u32 func_id)
 	}
 }
 
+static int kvm_hvc_user(struct kvm_vcpu *vcpu)
+{
+	int i;
+	struct kvm_run *run = vcpu->run;
+
+	if (!test_bit(KVM_ARCH_FLAG_HVC_TO_USER, &vcpu->kvm->arch.flags)) {
+		smccc_set_retval(vcpu, SMCCC_RET_NOT_SUPPORTED, 0, 0, 0);
+		return 1;
+	}
+
+	run->exit_reason = KVM_EXIT_HYPERCALL;
+	run->hypercall.nr = kvm_vcpu_hvc_get_imm(vcpu);
+	/* Add the first parameters for fast access. */
+	for (i = 0; i < 6; i++)
+		run->hypercall.args[i] = vcpu_get_reg(vcpu, i);
+	run->hypercall.ret = 0;
+	run->hypercall.longmode = 0;
+	run->hypercall.pad = 0;
+
+	return 0;
+}
+
 int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 {
 	struct kvm_smccc_features *smccc_feat = &vcpu->kvm->arch.smccc_feat;
@@ -219,8 +241,12 @@  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 	case ARM_SMCCC_TRNG_RND32:
 	case ARM_SMCCC_TRNG_RND64:
 		return kvm_trng_call(vcpu);
-	default:
+	case KVM_PSCI_FN_BASE...KVM_PSCI_FN_LAST:
+	case PSCI_0_2_FN_BASE...PSCI_0_2_FN_LAST:
+	case PSCI_0_2_FN64_BASE...PSCI_0_2_FN64_LAST:
 		return kvm_psci_call(vcpu);
+	default:
+		return kvm_hvc_user(vcpu);
 	}
 
 out:
diff --git a/include/kvm/arm_psci.h b/include/kvm/arm_psci.h
index 6e55b9283789..7391fa51419a 100644
--- a/include/kvm/arm_psci.h
+++ b/include/kvm/arm_psci.h
@@ -17,6 +17,10 @@ 
 
 #define KVM_ARM_PSCI_LATEST	KVM_ARM_PSCI_1_1
 
+#define KVM_PSCI_FN_LAST	KVM_PSCI_FN(3)
+#define PSCI_0_2_FN_LAST	PSCI_0_2_FN(0x3f)
+#define PSCI_0_2_FN64_LAST	PSCI_0_2_FN64(0x3f)
+
 static inline int kvm_psci_version(struct kvm_vcpu *vcpu)
 {
 	/*
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 55155e262646..2ead8b9aae56 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1175,6 +1175,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
 #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
 #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
+#define KVM_CAP_ARM_HVC_TO_USER 226
 
 #ifdef KVM_CAP_IRQ_ROUTING