[0/9] arm64: Stolen time support

Message ID	20190802145017.42543-1-steven.price@arm.com (mailing list archive)
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Steven Price <steven.price@arm.com> To: Subject: [PATCH 0/9] arm64: Stolen time support Date: Fri, 2 Aug 2019 15:50:08 +0100 Message-Id: <20190802145017.42543-1-steven.price@arm.com> MIME-Version: 1.0 Precedence: list Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, =?utf-8?b?UmFkaW0g?= =?utf-8?b?S3LEjW3DocWZ?= <rkrcmar@redhat.com>, Catalin Marinas <catalin.marinas@arm.com>, Suzuki K Pouloze <suzuki.poulose@arm.com>, linux-doc@vger.kernel.org, Russell King <linux@armlinux.org.uk>, Steven Price <steven.price@arm.com>, James Morse <james.morse@arm.com>, linux-arm-kernel@lists.infradead.org, Marc Zyngier <maz@kernel.org>, Paolo Bonzini <pbonzini@redhat.com>, Will Deacon <will@kernel.org>, kvmarm@lists.cs.columbia.edu, Julien Thierry <julien.thierry.kdev@gmail.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
Series	arm64: Stolen time support \| expand [0/9] arm64: Stolen time support [1/9] KVM: arm64: Document PV-time interface [2/9] KVM: arm/arm64: Factor out hypercall handling from PSCI code [3/9] KVM: arm64: Implement PV_FEATURES call [4/9] KVM: arm64: Support stolen time reporting via shared structure [5/9] KVM: Allow kvm_device_ops to be const [6/9] KVM: arm64: Provide a PV_TIME device to user space [7/9] arm/arm64: Provide a wrapper for SMCCC 1.1 calls [8/9] arm/arm64: Make use of the SMCCC 1.1 wrapper [9/9] arm64: Retrieve stolen time as paravirtualized guest

Steven Price Aug. 2, 2019, 2:50 p.m. UTC

This series add support for paravirtualized time for arm64 guests and
KVM hosts following the specification in Arm's document DEN 0057A:

https://developer.arm.com/docs/den0057/a

It implements support for stolen time, allowing the guest to
identify time when it is forcibly not executing.

It doesn't implement support for Live Physical Time (LPT) as there are
some concerns about the overheads and approach in the above
specification, and I expect an updated version of the specification to
be released soon with just the stolen time parts.

I previously posted a series including LPT (as well as stolen time):
https://lore.kernel.org/kvmarm/20181212150226.38051-1-steven.price@arm.com/

Patches 2, 5, 7 and 8 are cleanup patches and could be taken separately.

Christoffer Dall (1):
  KVM: arm/arm64: Factor out hypercall handling from PSCI code

Steven Price (8):
  KVM: arm64: Document PV-time interface
  KVM: arm64: Implement PV_FEATURES call
  KVM: arm64: Support stolen time reporting via shared structure
  KVM: Allow kvm_device_ops to be const
  KVM: arm64: Provide a PV_TIME device to user space
  arm/arm64: Provide a wrapper for SMCCC 1.1 calls
  arm/arm64: Make use of the SMCCC 1.1 wrapper
  arm64: Retrieve stolen time as paravirtualized guest

 Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++
 arch/arm/kvm/Makefile                    |   2 +-
 arch/arm/kvm/handle_exit.c               |   2 +-
 arch/arm/mm/proc-v7-bugs.c               |  13 +-
 arch/arm64/include/asm/kvm_host.h        |  13 +-
 arch/arm64/include/asm/kvm_mmu.h         |   2 +
 arch/arm64/include/asm/pvclock-abi.h     |  20 +++
 arch/arm64/include/uapi/asm/kvm.h        |   6 +
 arch/arm64/kernel/Makefile               |   1 +
 arch/arm64/kernel/cpu_errata.c           |  80 ++++------
 arch/arm64/kernel/kvm.c                  | 155 ++++++++++++++++++
 arch/arm64/kvm/Kconfig                   |   1 +
 arch/arm64/kvm/Makefile                  |   2 +
 arch/arm64/kvm/handle_exit.c             |   4 +-
 include/kvm/arm_hypercalls.h             |  44 ++++++
 include/kvm/arm_psci.h                   |   2 +-
 include/linux/arm-smccc.h                |  58 +++++++
 include/linux/cpuhotplug.h               |   1 +
 include/linux/kvm_host.h                 |   4 +-
 include/linux/kvm_types.h                |   2 +
 include/uapi/linux/kvm.h                 |   2 +
 virt/kvm/arm/arm.c                       |  18 +++
 virt/kvm/arm/hypercalls.c                | 138 ++++++++++++++++
 virt/kvm/arm/mmu.c                       |  44 ++++++
 virt/kvm/arm/psci.c                      |  84 +---------
 virt/kvm/arm/pvtime.c                    | 190 +++++++++++++++++++++++
 virt/kvm/kvm_main.c                      |   6 +-
 27 files changed, 848 insertions(+), 153 deletions(-)
 create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
 create mode 100644 arch/arm64/include/asm/pvclock-abi.h
 create mode 100644 arch/arm64/kernel/kvm.c
 create mode 100644 include/kvm/arm_hypercalls.h
 create mode 100644 virt/kvm/arm/hypercalls.c
 create mode 100644 virt/kvm/arm/pvtime.c

Marc Zyngier Aug. 3, 2019, 6:05 p.m. UTC | #1

On Fri,  2 Aug 2019 15:50:08 +0100
Steven Price <steven.price@arm.com> wrote:

Hi Steven,

> This series add support for paravirtualized time for arm64 guests and
> KVM hosts following the specification in Arm's document DEN 0057A:
> 
> https://developer.arm.com/docs/den0057/a
> 
> It implements support for stolen time, allowing the guest to
> identify time when it is forcibly not executing.
> 
> It doesn't implement support for Live Physical Time (LPT) as there are
> some concerns about the overheads and approach in the above
> specification, and I expect an updated version of the specification to
> be released soon with just the stolen time parts.

Thanks for posting this.

My current concern with this series is around the fact that we allocate
memory from the kernel on behalf of the guest. It is the first example
of such thing in the ARM port, and I can't really say I'm fond of it.

x86 seems to get away with it by having the memory allocated from
userspace, why I tend to like more. Yes, put_user is more
expensive than a straight store, but this isn't done too often either.

What is the rational for your current approach?

Thanks,

	M.

Steven Price Aug. 5, 2019, 1:06 p.m. UTC | #2

On 03/08/2019 19:05, Marc Zyngier wrote:
> On Fri,  2 Aug 2019 15:50:08 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
> Hi Steven,
> 
>> This series add support for paravirtualized time for arm64 guests and
>> KVM hosts following the specification in Arm's document DEN 0057A:
>>
>> https://developer.arm.com/docs/den0057/a
>>
>> It implements support for stolen time, allowing the guest to
>> identify time when it is forcibly not executing.
>>
>> It doesn't implement support for Live Physical Time (LPT) as there are
>> some concerns about the overheads and approach in the above
>> specification, and I expect an updated version of the specification to
>> be released soon with just the stolen time parts.
> 
> Thanks for posting this.
> 
> My current concern with this series is around the fact that we allocate
> memory from the kernel on behalf of the guest. It is the first example
> of such thing in the ARM port, and I can't really say I'm fond of it.
> 
> x86 seems to get away with it by having the memory allocated from
> userspace, why I tend to like more. Yes, put_user is more
> expensive than a straight store, but this isn't done too often either.
> 
> What is the rational for your current approach?

As I see it there are 3 approaches that can be taken here:

1. Hypervisor allocates memory and adds it to the virtual machine. This
means that everything to do with the 'device' is encapsulated behind the
KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want the
stolen time structure to be fast it cannot be a trapping region and has
to be backed by real memory - in this case allocated by the host kernel.

2. Host user space allocates memory. Similar to above, but this time
user space needs to manage the memory region as well as the usual
KVM_CREATE_DEVICE dance. I've no objection to this, but it means
kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
to size the memory region).

3. Guest kernel "donates" the memory to the hypervisor for the
structure. As far as I'm aware this is what x86 does. The problems I see
this approach are:

 a) kexec becomes much more tricky - there needs to be a disabling
mechanism for the guest to stop the hypervisor scribbling on memory
before starting the new kernel.

 b) If there is more than one entity that is interested in the
information (e.g. firmware and kernel) then this requires some form of
arbitration in the guest because the hypervisor doesn't want to have to
track an arbitrary number of regions to update.

 c) Performance can suffer if the host kernel doesn't have a suitably
aligned/sized area to use. As you say - put_user() is more expensive.
The structure is updated on every return to the VM.

Of course x86 does prove the third approach can work, but I'm not sure
which is actually better. Avoid the kexec cancellation requirements was
the main driver of the current approach. Although many of the
conversations about this were also tied up with Live Physical Time which
adds its own complications.

Steve

Marc Zyngier Aug. 5, 2019, 1:26 p.m. UTC | #3

On 05/08/2019 14:06, Steven Price wrote:
> On 03/08/2019 19:05, Marc Zyngier wrote:
>> On Fri,  2 Aug 2019 15:50:08 +0100
>> Steven Price <steven.price@arm.com> wrote:
>>
>> Hi Steven,
>>
>>> This series add support for paravirtualized time for arm64 guests and
>>> KVM hosts following the specification in Arm's document DEN 0057A:
>>>
>>> https://developer.arm.com/docs/den0057/a
>>>
>>> It implements support for stolen time, allowing the guest to
>>> identify time when it is forcibly not executing.
>>>
>>> It doesn't implement support for Live Physical Time (LPT) as there are
>>> some concerns about the overheads and approach in the above
>>> specification, and I expect an updated version of the specification to
>>> be released soon with just the stolen time parts.
>>
>> Thanks for posting this.
>>
>> My current concern with this series is around the fact that we allocate
>> memory from the kernel on behalf of the guest. It is the first example
>> of such thing in the ARM port, and I can't really say I'm fond of it.
>>
>> x86 seems to get away with it by having the memory allocated from
>> userspace, why I tend to like more. Yes, put_user is more
>> expensive than a straight store, but this isn't done too often either.
>>
>> What is the rational for your current approach?
> 
> As I see it there are 3 approaches that can be taken here:
> 
> 1. Hypervisor allocates memory and adds it to the virtual machine. This
> means that everything to do with the 'device' is encapsulated behind the
> KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want the
> stolen time structure to be fast it cannot be a trapping region and has
> to be backed by real memory - in this case allocated by the host kernel.
> 
> 2. Host user space allocates memory. Similar to above, but this time
> user space needs to manage the memory region as well as the usual
> KVM_CREATE_DEVICE dance. I've no objection to this, but it means
> kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
> to size the memory region).
> 
> 3. Guest kernel "donates" the memory to the hypervisor for the
> structure. As far as I'm aware this is what x86 does. The problems I see
> this approach are:
> 
>  a) kexec becomes much more tricky - there needs to be a disabling
> mechanism for the guest to stop the hypervisor scribbling on memory
> before starting the new kernel.
> 
>  b) If there is more than one entity that is interested in the
> information (e.g. firmware and kernel) then this requires some form of
> arbitration in the guest because the hypervisor doesn't want to have to
> track an arbitrary number of regions to update.
> 
>  c) Performance can suffer if the host kernel doesn't have a suitably
> aligned/sized area to use. As you say - put_user() is more expensive.
> The structure is updated on every return to the VM.
> 
> 
> Of course x86 does prove the third approach can work, but I'm not sure
> which is actually better. Avoid the kexec cancellation requirements was
> the main driver of the current approach. Although many of the
> conversations about this were also tied up with Live Physical Time which
> adds its own complications.

My current train of thoughts is around (2):

- We don't need a new mechanism to track pages or deal with overlapping
IPA ranges
- We can get rid of the save/restore interface

The drawback is that the amount of memory required per vcpu becomes ABI.
I don't think that's a huge deal, as the hypervisor has the same
contract with the guest.

We also take a small hit with put_user(), but this is only done as a
consequence of vcpu_load() (and not on every entry as you suggest
above). It'd be worth quantifying this overhead before making any
decision one way or another.

Thanks,

	M.

Alexander Graf Aug. 14, 2019, 1:02 p.m. UTC | #4

On 05.08.19 15:06, Steven Price wrote:
> On 03/08/2019 19:05, Marc Zyngier wrote:
>> On Fri,  2 Aug 2019 15:50:08 +0100
>> Steven Price <steven.price@arm.com> wrote:
>>
>> Hi Steven,
>>
>>> This series add support for paravirtualized time for arm64 guests and
>>> KVM hosts following the specification in Arm's document DEN 0057A:
>>>
>>> https://developer.arm.com/docs/den0057/a
>>>
>>> It implements support for stolen time, allowing the guest to
>>> identify time when it is forcibly not executing.
>>>
>>> It doesn't implement support for Live Physical Time (LPT) as there are
>>> some concerns about the overheads and approach in the above
>>> specification, and I expect an updated version of the specification to
>>> be released soon with just the stolen time parts.
>>
>> Thanks for posting this.
>>
>> My current concern with this series is around the fact that we allocate
>> memory from the kernel on behalf of the guest. It is the first example
>> of such thing in the ARM port, and I can't really say I'm fond of it.
>>
>> x86 seems to get away with it by having the memory allocated from
>> userspace, why I tend to like more. Yes, put_user is more
>> expensive than a straight store, but this isn't done too often either.
>>
>> What is the rational for your current approach?
> 
> As I see it there are 3 approaches that can be taken here:
> 
> 1. Hypervisor allocates memory and adds it to the virtual machine. This
> means that everything to do with the 'device' is encapsulated behind the
> KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want the
> stolen time structure to be fast it cannot be a trapping region and has
> to be backed by real memory - in this case allocated by the host kernel.
> 
> 2. Host user space allocates memory. Similar to above, but this time
> user space needs to manage the memory region as well as the usual
> KVM_CREATE_DEVICE dance. I've no objection to this, but it means
> kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
> to size the memory region).

You ideally want to get the host overhead for a VM to as little as you 
can. I'm not terribly fond of the idea of reserving a full page just 
because we're too afraid of having the guest donate memory.

> 
> 3. Guest kernel "donates" the memory to the hypervisor for the
> structure. As far as I'm aware this is what x86 does. The problems I see
> this approach are:
> 
>   a) kexec becomes much more tricky - there needs to be a disabling
> mechanism for the guest to stop the hypervisor scribbling on memory
> before starting the new kernel.

I wouldn't call "quiesce a device" much more tricky. We have to do that 
for other devices as well today.

>   b) If there is more than one entity that is interested in the
> information (e.g. firmware and kernel) then this requires some form of
> arbitration in the guest because the hypervisor doesn't want to have to
> track an arbitrary number of regions to update.

Why would FW care?

>   c) Performance can suffer if the host kernel doesn't have a suitably
> aligned/sized area to use. As you say - put_user() is more expensive.

Just define the interface to always require natural alignment when 
donating a memory location?

> The structure is updated on every return to the VM.

If you really do suffer from put_user(), there are alternatives. You 
could just map the page on the registration hcall and then leave it 
pinned until the vcpu gets destroyed again.

> Of course x86 does prove the third approach can work, but I'm not sure
> which is actually better. Avoid the kexec cancellation requirements was
> the main driver of the current approach. Although many of the

I really don't understand the problem with kexec cancellation. Worst 
case, let guest FW set it up for you and propagate only the address down 
via ACPI/DT. That way you can mark the respective memory as reserved too.

But even with a Linux only mechanism, just take a look at 
arch/x86/kernel/kvmclock.c. All they do to remove the map is to hook 
into machine_crash_shutdown() and machine_shutdown().


Alex

> conversations about this were also tied up with Live Physical Time which
> adds its own complications.
> 
> Steve
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>

Marc Zyngier Aug. 14, 2019, 2:19 p.m. UTC | #5

On Wed, 14 Aug 2019 14:02:25 +0100,
Alexander Graf <graf@amazon.com> wrote:
> 
> 
> 
> On 05.08.19 15:06, Steven Price wrote:
> > On 03/08/2019 19:05, Marc Zyngier wrote:
> >> On Fri,  2 Aug 2019 15:50:08 +0100
> >> Steven Price <steven.price@arm.com> wrote:
> >> 
> >> Hi Steven,
> >> 
> >>> This series add support for paravirtualized time for arm64 guests and
> >>> KVM hosts following the specification in Arm's document DEN 0057A:
> >>> 
> >>> https://developer.arm.com/docs/den0057/a
> >>> 
> >>> It implements support for stolen time, allowing the guest to
> >>> identify time when it is forcibly not executing.
> >>> 
> >>> It doesn't implement support for Live Physical Time (LPT) as there are
> >>> some concerns about the overheads and approach in the above
> >>> specification, and I expect an updated version of the specification to
> >>> be released soon with just the stolen time parts.
> >> 
> >> Thanks for posting this.
> >> 
> >> My current concern with this series is around the fact that we allocate
> >> memory from the kernel on behalf of the guest. It is the first example
> >> of such thing in the ARM port, and I can't really say I'm fond of it.
> >> 
> >> x86 seems to get away with it by having the memory allocated from
> >> userspace, why I tend to like more. Yes, put_user is more
> >> expensive than a straight store, but this isn't done too often either.
> >> 
> >> What is the rational for your current approach?
> > 
> > As I see it there are 3 approaches that can be taken here:
> > 
> > 1. Hypervisor allocates memory and adds it to the virtual machine. This
> > means that everything to do with the 'device' is encapsulated behind the
> > KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want the
> > stolen time structure to be fast it cannot be a trapping region and has
> > to be backed by real memory - in this case allocated by the host kernel.
> > 
> > 2. Host user space allocates memory. Similar to above, but this time
> > user space needs to manage the memory region as well as the usual
> > KVM_CREATE_DEVICE dance. I've no objection to this, but it means
> > kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
> > to size the memory region).
> 
> You ideally want to get the host overhead for a VM to as little as you
> can. I'm not terribly fond of the idea of reserving a full page just
> because we're too afraid of having the guest donate memory.

Well, reduce the amount of memory you give to the guest by one page,
and allocate that page to the stolen time device. Problem solved!

Seriously, if you're worried about the allocation of a single page,
you should first look at how many holes we have in the vcpu structure,
for example (even better, with the 8.4 NV patches applied). Just
fixing that would give you that page back *per vcpu*.

> > 3. Guest kernel "donates" the memory to the hypervisor for the
> > structure. As far as I'm aware this is what x86 does. The problems I see
> > this approach are:
> > 
> >   a) kexec becomes much more tricky - there needs to be a disabling
> > mechanism for the guest to stop the hypervisor scribbling on memory
> > before starting the new kernel.
> 
> I wouldn't call "quiesce a device" much more tricky. We have to do
> that for other devices as well today.

And since there is no standard way of doing it, we keep inventing
weird and wonderful ways of doing so -- cue the terrible GICv3 LPI
situation, and all the various hacks to keep existing IOMMU mappings
around across firmware/kernel handovers as well as kexec.

> 
> >   b) If there is more than one entity that is interested in the
> > information (e.g. firmware and kernel) then this requires some form of
> > arbitration in the guest because the hypervisor doesn't want to have to
> > track an arbitrary number of regions to update.
> 
> Why would FW care?

Exactly. It doesn't care. Not caring means it doesn't know about the
page the guest has allocated for stolen time, and starts using it for
its own purposes. Hello, memory corruption. Same thing goes if you
reboot into a non stolen time aware kernel.

> 
> >   c) Performance can suffer if the host kernel doesn't have a suitably
> > aligned/sized area to use. As you say - put_user() is more expensive.
> 
> Just define the interface to always require natural alignment when
> donating a memory location?
> 
> > The structure is updated on every return to the VM.
> 
> If you really do suffer from put_user(), there are alternatives. You
> could just map the page on the registration hcall and then leave it
> pinned until the vcpu gets destroyed again.

put_user() should be cheap enough. It is one of the things we tend to
optimise anyway. And yes, worse case, we pin the page.

> 
> > Of course x86 does prove the third approach can work, but I'm not sure
> > which is actually better. Avoid the kexec cancellation requirements was
> > the main driver of the current approach. Although many of the
> 
> I really don't understand the problem with kexec cancellation. Worst
> case, let guest FW set it up for you and propagate only the address
> down via ACPI/DT. That way you can mark the respective memory as
> reserved too.

We already went down that road with the LPI hack. I'm not going there
again if we can avoid it. And it turn out that we can. Just allocate
the stolen time page as a separate memblock, give it to KVM for that
purpose.

Your suggestion of letting the guest firmware set something up only
works if whatever you're booting after that understands it. If it
doesn't, you're screwed.

> But even with a Linux only mechanism, just take a look at
> arch/x86/kernel/kvmclock.c. All they do to remove the map is to hook
> into machine_crash_shutdown() and machine_shutdown().

I'm not going to take something that is Linux specific. It has to work
for all guests, at all times, whether they know about the hypervisor
service or not.

	M.

> 
> 
> Alex
> 
> > conversations about this were also tied up with Live Physical Time which
> > adds its own complications.
> > 
> > Steve
> > _______________________________________________
> > kvmarm mailing list
> > kvmarm@lists.cs.columbia.edu
> > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> >

Alexander Graf Aug. 14, 2019, 2:52 p.m. UTC | #6

On 14.08.19 16:19, Marc Zyngier wrote:
> On Wed, 14 Aug 2019 14:02:25 +0100,
> Alexander Graf <graf@amazon.com> wrote:
>>
>>
>>
>> On 05.08.19 15:06, Steven Price wrote:
>>> On 03/08/2019 19:05, Marc Zyngier wrote:
>>>> On Fri,  2 Aug 2019 15:50:08 +0100
>>>> Steven Price <steven.price@arm.com> wrote:
>>>>
>>>> Hi Steven,
>>>>
>>>>> This series add support for paravirtualized time for arm64 guests and
>>>>> KVM hosts following the specification in Arm's document DEN 0057A:
>>>>>
>>>>> https://developer.arm.com/docs/den0057/a
>>>>>
>>>>> It implements support for stolen time, allowing the guest to
>>>>> identify time when it is forcibly not executing.
>>>>>
>>>>> It doesn't implement support for Live Physical Time (LPT) as there are
>>>>> some concerns about the overheads and approach in the above
>>>>> specification, and I expect an updated version of the specification to
>>>>> be released soon with just the stolen time parts.
>>>>
>>>> Thanks for posting this.
>>>>
>>>> My current concern with this series is around the fact that we allocate
>>>> memory from the kernel on behalf of the guest. It is the first example
>>>> of such thing in the ARM port, and I can't really say I'm fond of it.
>>>>
>>>> x86 seems to get away with it by having the memory allocated from
>>>> userspace, why I tend to like more. Yes, put_user is more
>>>> expensive than a straight store, but this isn't done too often either.
>>>>
>>>> What is the rational for your current approach?
>>>
>>> As I see it there are 3 approaches that can be taken here:
>>>
>>> 1. Hypervisor allocates memory and adds it to the virtual machine. This
>>> means that everything to do with the 'device' is encapsulated behind the
>>> KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want the
>>> stolen time structure to be fast it cannot be a trapping region and has
>>> to be backed by real memory - in this case allocated by the host kernel.
>>>
>>> 2. Host user space allocates memory. Similar to above, but this time
>>> user space needs to manage the memory region as well as the usual
>>> KVM_CREATE_DEVICE dance. I've no objection to this, but it means
>>> kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
>>> to size the memory region).
>>
>> You ideally want to get the host overhead for a VM to as little as you
>> can. I'm not terribly fond of the idea of reserving a full page just
>> because we're too afraid of having the guest donate memory.
> 
> Well, reduce the amount of memory you give to the guest by one page,
> and allocate that page to the stolen time device. Problem solved!
> 
> Seriously, if you're worried about the allocation of a single page,
> you should first look at how many holes we have in the vcpu structure,
> for example (even better, with the 8.4 NV patches applied). Just
> fixing that would give you that page back *per vcpu*.

I'm worried about additional memory slots, about fragmenting the 
cachable guest memory regions, about avoidable HV taxes.

I think we need to distinguish here between the KVM implementation and 
the hypervisor/guest interface. Just because in KVM we can save overhead 
today doesn't mean that the HV interface should be built around the 
assumption that "memory is free".

> 
>>> 3. Guest kernel "donates" the memory to the hypervisor for the
>>> structure. As far as I'm aware this is what x86 does. The problems I see
>>> this approach are:
>>>
>>>    a) kexec becomes much more tricky - there needs to be a disabling
>>> mechanism for the guest to stop the hypervisor scribbling on memory
>>> before starting the new kernel.
>>
>> I wouldn't call "quiesce a device" much more tricky. We have to do
>> that for other devices as well today.
> 
> And since there is no standard way of doing it, we keep inventing
> weird and wonderful ways of doing so -- cue the terrible GICv3 LPI
> situation, and all the various hacks to keep existing IOMMU mappings
> around across firmware/kernel handovers as well as kexec.

Well, the good news here is that we don't have to keep it around ;).

> 
>>
>>>    b) If there is more than one entity that is interested in the
>>> information (e.g. firmware and kernel) then this requires some form of
>>> arbitration in the guest because the hypervisor doesn't want to have to
>>> track an arbitrary number of regions to update.
>>
>> Why would FW care?
> 
> Exactly. It doesn't care. Not caring means it doesn't know about the
> page the guest has allocated for stolen time, and starts using it for
> its own purposes. Hello, memory corruption. Same thing goes if you
> reboot into a non stolen time aware kernel.

If you reboot, you go via the vcpu reset path which clears the map, no? 
Same goes for FW entry. If you enter firmware that does not set up the 
map, you never see it.

> 
>>
>>>    c) Performance can suffer if the host kernel doesn't have a suitably
>>> aligned/sized area to use. As you say - put_user() is more expensive.
>>
>> Just define the interface to always require natural alignment when
>> donating a memory location?
>>
>>> The structure is updated on every return to the VM.
>>
>> If you really do suffer from put_user(), there are alternatives. You
>> could just map the page on the registration hcall and then leave it
>> pinned until the vcpu gets destroyed again.
> 
> put_user() should be cheap enough. It is one of the things we tend to
> optimise anyway. And yes, worse case, we pin the page.
> 
>>
>>> Of course x86 does prove the third approach can work, but I'm not sure
>>> which is actually better. Avoid the kexec cancellation requirements was
>>> the main driver of the current approach. Although many of the
>>
>> I really don't understand the problem with kexec cancellation. Worst
>> case, let guest FW set it up for you and propagate only the address
>> down via ACPI/DT. That way you can mark the respective memory as
>> reserved too.
> 
> We already went down that road with the LPI hack. I'm not going there
> again if we can avoid it. And it turn out that we can. Just allocate
> the stolen time page as a separate memblock, give it to KVM for that
> purpose.
> 
> Your suggestion of letting the guest firmware set something up only
> works if whatever you're booting after that understands it. If it
> doesn't, you're screwed.

Why? For UEFI, mark the region as reserved in the memory map. For DT, 
just mark it straight on reserved.

That said, I'm not advocating for doing it in the FW. I think this can 
be solved really easily with a simple guest driver to enable and a vcpu 
reset hook to disable the map.

> 
>> But even with a Linux only mechanism, just take a look at
>> arch/x86/kernel/kvmclock.c. All they do to remove the map is to hook
>> into machine_crash_shutdown() and machine_shutdown().
> 
> I'm not going to take something that is Linux specific. It has to work
> for all guests, at all times, whether they know about the hypervisor
> service or not.

If they don't know about the HV service, they don't register the writer, 
so they don't see corruption.

If they know about the HV service and they don't support kexec, they 
don't have to worry because a vcpu reset should also clear the map.

If they do support kexec, they already have a mechanism to quiesce devices.

So I don't understand how this is Linux specific? The question was Linux 
specific, so I answered with precedence to show that disabling on kexec 
is not all that hard :).


Alex

Steven Price Aug. 16, 2019, 10:23 a.m. UTC | #7

On 14/08/2019 15:52, Alexander Graf wrote:
> 
> 
> On 14.08.19 16:19, Marc Zyngier wrote:
>> On Wed, 14 Aug 2019 14:02:25 +0100,
>> Alexander Graf <graf@amazon.com> wrote:
>>>
>>>
>>>
>>> On 05.08.19 15:06, Steven Price wrote:
>>>> On 03/08/2019 19:05, Marc Zyngier wrote:
>>>>> On Fri,  2 Aug 2019 15:50:08 +0100
>>>>> Steven Price <steven.price@arm.com> wrote:
>>>>>
>>>>> Hi Steven,
>>>>>
>>>>>> This series add support for paravirtualized time for arm64 guests and
>>>>>> KVM hosts following the specification in Arm's document DEN 0057A:
>>>>>>
>>>>>> https://developer.arm.com/docs/den0057/a
>>>>>>
>>>>>> It implements support for stolen time, allowing the guest to
>>>>>> identify time when it is forcibly not executing.
>>>>>>
>>>>>> It doesn't implement support for Live Physical Time (LPT) as there
>>>>>> are
>>>>>> some concerns about the overheads and approach in the above
>>>>>> specification, and I expect an updated version of the
>>>>>> specification to
>>>>>> be released soon with just the stolen time parts.
>>>>>
>>>>> Thanks for posting this.
>>>>>
>>>>> My current concern with this series is around the fact that we
>>>>> allocate
>>>>> memory from the kernel on behalf of the guest. It is the first example
>>>>> of such thing in the ARM port, and I can't really say I'm fond of it.
>>>>>
>>>>> x86 seems to get away with it by having the memory allocated from
>>>>> userspace, why I tend to like more. Yes, put_user is more
>>>>> expensive than a straight store, but this isn't done too often either.
>>>>>
>>>>> What is the rational for your current approach?
>>>>
>>>> As I see it there are 3 approaches that can be taken here:
>>>>
>>>> 1. Hypervisor allocates memory and adds it to the virtual machine. This
>>>> means that everything to do with the 'device' is encapsulated behind
>>>> the
>>>> KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want
>>>> the
>>>> stolen time structure to be fast it cannot be a trapping region and has
>>>> to be backed by real memory - in this case allocated by the host
>>>> kernel.
>>>>
>>>> 2. Host user space allocates memory. Similar to above, but this time
>>>> user space needs to manage the memory region as well as the usual
>>>> KVM_CREATE_DEVICE dance. I've no objection to this, but it means
>>>> kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
>>>> to size the memory region).
>>>
>>> You ideally want to get the host overhead for a VM to as little as you
>>> can. I'm not terribly fond of the idea of reserving a full page just
>>> because we're too afraid of having the guest donate memory.
>>
>> Well, reduce the amount of memory you give to the guest by one page,
>> and allocate that page to the stolen time device. Problem solved!
>>
>> Seriously, if you're worried about the allocation of a single page,
>> you should first look at how many holes we have in the vcpu structure,
>> for example (even better, with the 8.4 NV patches applied). Just
>> fixing that would give you that page back *per vcpu*.
> 
> I'm worried about additional memory slots, about fragmenting the
> cachable guest memory regions, about avoidable HV taxes.
> 
> I think we need to distinguish here between the KVM implementation and
> the hypervisor/guest interface. Just because in KVM we can save overhead
> today doesn't mean that the HV interface should be built around the
> assumption that "memory is free".

The HV interface just requires that the host provides some memory for
the structures to live in. The memory can be adjacent (or even within)
the normal memory of the guest. The only requirement is that the guest
isn't told to use this memory for normal allocations (i.e. it should
either be explicitly reserved or just not contained within the normal
memory block).

>>
>>>> 3. Guest kernel "donates" the memory to the hypervisor for the
>>>> structure. As far as I'm aware this is what x86 does. The problems I
>>>> see
>>>> this approach are:
>>>>
>>>>    a) kexec becomes much more tricky - there needs to be a disabling
>>>> mechanism for the guest to stop the hypervisor scribbling on memory
>>>> before starting the new kernel.
>>>
>>> I wouldn't call "quiesce a device" much more tricky. We have to do
>>> that for other devices as well today.
>>
>> And since there is no standard way of doing it, we keep inventing
>> weird and wonderful ways of doing so -- cue the terrible GICv3 LPI
>> situation, and all the various hacks to keep existing IOMMU mappings
>> around across firmware/kernel handovers as well as kexec.
> 
> Well, the good news here is that we don't have to keep it around ;).
> 
>>
>>>
>>>>    b) If there is more than one entity that is interested in the
>>>> information (e.g. firmware and kernel) then this requires some form of
>>>> arbitration in the guest because the hypervisor doesn't want to have to
>>>> track an arbitrary number of regions to update.
>>>
>>> Why would FW care?
>>
>> Exactly. It doesn't care. Not caring means it doesn't know about the
>> page the guest has allocated for stolen time, and starts using it for
>> its own purposes. Hello, memory corruption. Same thing goes if you
>> reboot into a non stolen time aware kernel.
> 
> If you reboot, you go via the vcpu reset path which clears the map, no?
> Same goes for FW entry. If you enter firmware that does not set up the
> map, you never see it.

Doing this per-vcpu implies you are probably going to have to allocate
an entire page per vcpu. Because it's entirely possible for a guest to
reset an individual vcpu. Or at the least there's some messy reference
counting going on here.

Having a region of memory provided by the host means the structures can
be packed and there's nothing to be done in the reset path.

>>
>>>
>>>>    c) Performance can suffer if the host kernel doesn't have a suitably
>>>> aligned/sized area to use. As you say - put_user() is more expensive.
>>>
>>> Just define the interface to always require natural alignment when
>>> donating a memory location?
>>>
>>>> The structure is updated on every return to the VM.
>>>
>>> If you really do suffer from put_user(), there are alternatives. You
>>> could just map the page on the registration hcall and then leave it
>>> pinned until the vcpu gets destroyed again.
>>
>> put_user() should be cheap enough. It is one of the things we tend to
>> optimise anyway. And yes, worse case, we pin the page.
>>
>>>
>>>> Of course x86 does prove the third approach can work, but I'm not sure
>>>> which is actually better. Avoid the kexec cancellation requirements was
>>>> the main driver of the current approach. Although many of the
>>>
>>> I really don't understand the problem with kexec cancellation. Worst
>>> case, let guest FW set it up for you and propagate only the address
>>> down via ACPI/DT. That way you can mark the respective memory as
>>> reserved too.
>>
>> We already went down that road with the LPI hack. I'm not going there
>> again if we can avoid it. And it turn out that we can. Just allocate
>> the stolen time page as a separate memblock, give it to KVM for that
>> purpose.
>>
>> Your suggestion of letting the guest firmware set something up only
>> works if whatever you're booting after that understands it. If it
>> doesn't, you're screwed.
> 
> Why? For UEFI, mark the region as reserved in the memory map. For DT,
> just mark it straight on reserved.
> 
> That said, I'm not advocating for doing it in the FW. I think this can
> be solved really easily with a simple guest driver to enable and a vcpu
> reset hook to disable the map.
> 
>>
>>> But even with a Linux only mechanism, just take a look at
>>> arch/x86/kernel/kvmclock.c. All they do to remove the map is to hook
>>> into machine_crash_shutdown() and machine_shutdown().
>>
>> I'm not going to take something that is Linux specific. It has to work
>> for all guests, at all times, whether they know about the hypervisor
>> service or not.
> 
> If they don't know about the HV service, they don't register the writer,
> so they don't see corruption.
> 
> If they know about the HV service and they don't support kexec, they
> don't have to worry because a vcpu reset should also clear the map.
> 
> If they do support kexec, they already have a mechanism to quiesce devices.
> 
> So I don't understand how this is Linux specific? The question was Linux
> specific, so I answered with precedence to show that disabling on kexec
> is not all that hard :).

My concern is more around a something like Jailhouse as a
guest-hypervisor. There Linux gives up CPUs to run another OS. This
hand-off of a CPU is much easier if there's just a structure in memory
somewhere which doesn't move.

The kexec case like you say can be handled as a device to quiesce.

I don't think either scheme is unworkable, but I do think getting the
host to provide the memory is easier for both guest and host. Marc had a
good point that getting user space to allocate the memory is probably
preferable to getting the host kernel to do so, so I'm reworking the
code to do that.

Steve

zhukeqian July 21, 2020, 3:26 a.m. UTC | #8

Hi Steven,

On 2019/8/2 22:50, Steven Price wrote:
> This series add support for paravirtualized time for arm64 guests and
> KVM hosts following the specification in Arm's document DEN 0057A:
> 
> https://developer.arm.com/docs/den0057/a
> 
> It implements support for stolen time, allowing the guest to
> identify time when it is forcibly not executing.
> 
> It doesn't implement support for Live Physical Time (LPT) as there are
> some concerns about the overheads and approach in the above
Do you plan to pick up LPT support? As there is demand of cross-frequency migration
(from older platform to newer platform).

I am not clear about the overheads and approach problem here, could you please
give some detail information? Maybe we can work together to solve these concerns. :-)

Thanks,
Keqian
> specification, and I expect an updated version of the specification to
> be released soon with just the stolen time parts.
> 
> I previously posted a series including LPT (as well as stolen time):
> https://lore.kernel.org/kvmarm/20181212150226.38051-1-steven.price@arm.com/
> 
> Patches 2, 5, 7 and 8 are cleanup patches and could be taken separately.
> 
> Christoffer Dall (1):
>   KVM: arm/arm64: Factor out hypercall handling from PSCI code
> 
> Steven Price (8):
>   KVM: arm64: Document PV-time interface
>   KVM: arm64: Implement PV_FEATURES call
>   KVM: arm64: Support stolen time reporting via shared structure
>   KVM: Allow kvm_device_ops to be const
>   KVM: arm64: Provide a PV_TIME device to user space
>   arm/arm64: Provide a wrapper for SMCCC 1.1 calls
>   arm/arm64: Make use of the SMCCC 1.1 wrapper
>   arm64: Retrieve stolen time as paravirtualized guest
> 
>  Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++
>  arch/arm/kvm/Makefile                    |   2 +-
>  arch/arm/kvm/handle_exit.c               |   2 +-
>  arch/arm/mm/proc-v7-bugs.c               |  13 +-
>  arch/arm64/include/asm/kvm_host.h        |  13 +-
>  arch/arm64/include/asm/kvm_mmu.h         |   2 +
>  arch/arm64/include/asm/pvclock-abi.h     |  20 +++
>  arch/arm64/include/uapi/asm/kvm.h        |   6 +
>  arch/arm64/kernel/Makefile               |   1 +
>  arch/arm64/kernel/cpu_errata.c           |  80 ++++------
>  arch/arm64/kernel/kvm.c                  | 155 ++++++++++++++++++
>  arch/arm64/kvm/Kconfig                   |   1 +
>  arch/arm64/kvm/Makefile                  |   2 +
>  arch/arm64/kvm/handle_exit.c             |   4 +-
>  include/kvm/arm_hypercalls.h             |  44 ++++++
>  include/kvm/arm_psci.h                   |   2 +-
>  include/linux/arm-smccc.h                |  58 +++++++
>  include/linux/cpuhotplug.h               |   1 +
>  include/linux/kvm_host.h                 |   4 +-
>  include/linux/kvm_types.h                |   2 +
>  include/uapi/linux/kvm.h                 |   2 +
>  virt/kvm/arm/arm.c                       |  18 +++
>  virt/kvm/arm/hypercalls.c                | 138 ++++++++++++++++
>  virt/kvm/arm/mmu.c                       |  44 ++++++
>  virt/kvm/arm/psci.c                      |  84 +---------
>  virt/kvm/arm/pvtime.c                    | 190 +++++++++++++++++++++++
>  virt/kvm/kvm_main.c                      |   6 +-
>  27 files changed, 848 insertions(+), 153 deletions(-)
>  create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
>  create mode 100644 arch/arm64/include/asm/pvclock-abi.h
>  create mode 100644 arch/arm64/kernel/kvm.c
>  create mode 100644 include/kvm/arm_hypercalls.h
>  create mode 100644 virt/kvm/arm/hypercalls.c
>  create mode 100644 virt/kvm/arm/pvtime.c
>

Steven Price July 27, 2020, 10:48 a.m. UTC | #9

On 21/07/2020 04:26, zhukeqian wrote:
> Hi Steven,

Hi Keqian,

> On 2019/8/2 22:50, Steven Price wrote:
>> This series add support for paravirtualized time for arm64 guests and
>> KVM hosts following the specification in Arm's document DEN 0057A:
>>
>> https://developer.arm.com/docs/den0057/a
>>
>> It implements support for stolen time, allowing the guest to
>> identify time when it is forcibly not executing.
>>
>> It doesn't implement support for Live Physical Time (LPT) as there are
>> some concerns about the overheads and approach in the above
> Do you plan to pick up LPT support? As there is demand of cross-frequency migration
> (from older platform to newer platform).

I don't have any plans to pick up the LPT support at the moment - feel 
free to pick it up! ;)

> I am not clear about the overheads and approach problem here, could you please
> give some detail information? Maybe we can work together to solve these concerns. :-)

Fundamentally the issue here is that LPT only solves one small part of 
migration between different hosts. To successfully migrate between hosts 
with different CPU implementations it is also necessary to be able to 
virtualise various ID registers (e.g. MIDR_EL1, REVIDR_EL1, AIDR_EL1) 
which we have no support for currently.

The problem with just virtualising the registers is how you handle 
errata. The guest will currently use those (and other) ID registers to 
decide whether to enable specific errata workarounds. But what errata 
should be enabled for a guest which might migrate to another host?

What we ideally need is a mechanism to communicate to the guest what 
workarounds are required to successfully run on any of the hosts that 
the guest may be migrated to. You may also have the situation where the 
workarounds required for two hosts are mutually incompatible - something 
needs to understand this and do the "right thing" (most likely just 
reject this situation, i.e. prevent the migration).

There are various options here: e.g. a para-virtualised interface to 
describe the workarounds (but this is hard to do in an OS-agnostic way), 
or virtual-ID registers describing an idealised environment where no 
workarounds are required (and only hosts that have no errata affecting a 
guest would be able to provide this).

Given the above complexity and the fact that Armv8.6-A standardises the 
frequency to 1GHz this didn't seem worth continuing with. So LPT was 
dropped from the spec and patches to avoid holding up the stolen time 
support.

However, if you have a use case which doesn't require such a generic 
migration (e.g. perhaps old and new platforms are based on the same IP) 
then it might be worth looking at bring this back. But to make the 
problem solvable it either needs to be restricted to platforms which are 
substantially the same (so the errata list will be identical), or 
there's work to be done in preparation to deal with migrating a guest 
successfully between hosts with potentially different errata requirements.

Can you share more details about the hosts that you are interested in 
migrating between?

Thanks,

Steve

zhukeqian July 29, 2020, 2:57 a.m. UTC | #10

Hi Steven,

On 2020/7/27 18:48, Steven Price wrote:
> On 21/07/2020 04:26, zhukeqian wrote:
>> Hi Steven,
> 
> Hi Keqian,
> 
>> On 2019/8/2 22:50, Steven Price wrote:
>>> This series add support for paravirtualized time for arm64 guests and
>>> KVM hosts following the specification in Arm's document DEN 0057A:
>>>
>>> https://developer.arm.com/docs/den0057/a
>>>
>>> It implements support for stolen time, allowing the guest to
>>> identify time when it is forcibly not executing.
>>>
>>> It doesn't implement support for Live Physical Time (LPT) as there are
>>> some concerns about the overheads and approach in the above
>> Do you plan to pick up LPT support? As there is demand of cross-frequency migration
>> (from older platform to newer platform).
> 
> I don't have any plans to pick up the LPT support at the moment - feel free to pick it up! ;)
> 
>> I am not clear about the overheads and approach problem here, could you please
>> give some detail information? Maybe we can work together to solve these concerns. :-)
> 
> Fundamentally the issue here is that LPT only solves one small part of migration between different hosts. To successfully migrate between hosts with different CPU implementations it is also necessary to be able to virtualise various ID registers (e.g. MIDR_EL1, REVIDR_EL1, AIDR_EL1) which we have no support for currently.
> 
Yeah, currently we are trying to do both timer freq virtualization and CPU feature virtualization.

> The problem with just virtualising the registers is how you handle errata. The guest will currently use those (and other) ID registers to decide whether to enable specific errata workarounds. But what errata should be enabled for a guest which might migrate to another host?
> 
Thanks for pointing this out.

I think the most important thing is that we should introduce a concept named CPU baseline which represents a standard platform.
If we bring up a guest with a specific CPU baseline, then this guest can only run on a platform that is compatible with this CPU baseline.
So "baseline" and "compatible" are the key point to promise successful cross-platform migration.


> What we ideally need is a mechanism to communicate to the guest what workarounds are required to successfully run on any of the hosts that the guest may be migrated to. You may also have the situation where the workarounds required for two hosts are mutually incompatible - something needs to understand this and do the "right thing" (most likely just reject this situation, i.e. prevent the migration).
> 
> There are various options here: e.g. a para-virtualised interface to describe the workarounds (but this is hard to do in an OS-agnostic way), or virtual-ID registers describing an idealised environment where no workarounds are required (and only hosts that have no errata affecting a guest would be able to provide this).
> 
My idea is similar with the "idealised environment", but errata workaround still exists.
We do not provide para-virtualised interface, and migration is restricted between platforms that are compatible with baseline.

Baseline should has two aspects: CPU feature and errata. These platforms that are compatible with a specific baseline should have the corresponding CPU feature and errata.

> Given the above complexity and the fact that Armv8.6-A standardises the frequency to 1GHz this didn't seem worth continuing with. So LPT was dropped from the spec and patches to avoid holding up the stolen time support.
> 
> However, if you have a use case which doesn't require such a generic migration (e.g. perhaps old and new platforms are based on the same IP) then it might be worth looking at bring this back. But to make the problem solvable it either needs to be restricted to platforms which are substantially the same (so the errata list will be identical), or there's work to be done in preparation to deal with migrating a guest successfully between hosts with potentially different errata requirements.
> 
> Can you share more details about the hosts that you are interested in migrating between?
Here we have new platform with 1GHz timer, and old platform is 100MHZ, so we want to solve the cross-platform migration firstly.

Thanks,
Keqian
> 
> Thanks,
> 
> Steve
> .
>

[0/9] arm64: Stolen time support

Message

Comments