diff mbox series

[1/9] KVM: arm64: Document PV-time interface

Message ID 20190802145017.42543-2-steven.price@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64: Stolen time support | expand

Commit Message

Steven Price Aug. 2, 2019, 2:50 p.m. UTC
Introduce a paravirtualization interface for KVM/arm64 based on the
"Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.

This only adds the details about "Stolen Time" as the details of "Live
Physical Time" have not been fully agreed.

User space can specify a reserved area of memory for the guest and
inform KVM to populate the memory with information on time that the host
kernel has stolen from the guest.

A hypercall interface is provided for the guest to interrogate the
hypervisor's support for this interface and the location of the shared
memory structures.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++++++++++++
 1 file changed, 107 insertions(+)
 create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt

Comments

Marc Zyngier Aug. 3, 2019, 11:13 a.m. UTC | #1
On Fri,  2 Aug 2019 15:50:09 +0100
Steven Price <steven.price@arm.com> wrote:

[+Peter for the userspace aspect of things]

Hi Steve,

> Introduce a paravirtualization interface for KVM/arm64 based on the
> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
> 
> This only adds the details about "Stolen Time" as the details of "Live
> Physical Time" have not been fully agreed.
> 
> User space can specify a reserved area of memory for the guest and
> inform KVM to populate the memory with information on time that the host
> kernel has stolen from the guest.
> 
> A hypercall interface is provided for the guest to interrogate the
> hypervisor's support for this interface and the location of the shared
> memory structures.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++++++++++++
>  1 file changed, 107 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
> 
> diff --git a/Documentation/virtual/kvm/arm/pvtime.txt b/Documentation/virtual/kvm/arm/pvtime.txt
> new file mode 100644
> index 000000000000..e6ae9799e1d5
> --- /dev/null
> +++ b/Documentation/virtual/kvm/arm/pvtime.txt
> @@ -0,0 +1,107 @@
> +Paravirtualized time support for arm64
> +======================================
> +
> +Arm specification DEN0057/A defined a standard for paravirtualised time
> +support for Aarch64 guests:

nit: AArch64

> +
> +https://developer.arm.com/docs/den0057/a

Between this file and the above document, which one is authoritative?

> +
> +KVM/Arm64 implements the stolen time part of this specification by providing

nit: KVM/arm64

> +some hypervisor service calls to support a paravirtualized guest obtaining a
> +view of the amount of time stolen from its execution.
> +
> +Two new SMCCC compatible hypercalls are defined:
> +
> +PV_FEATURES 0xC5000020
> +PV_TIME_ST  0xC5000022
> +
> +These are only available in the SMC64/HVC64 calling convention as
> +paravirtualized time is not available to 32 bit Arm guests.
> +
> +PV_FEATURES
> +    Function ID:  (uint32)  : 0xC5000020
> +    PV_func_id:   (uint32)  : Either PV_TIME_LPT or PV_TIME_ST
> +    Return value: (int32)   : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
> +                              PV-time feature is supported by the hypervisor.

How is PV_FEATURES discovered? Is the intention to make it a generic
ARM-wide PV discovery mechanism, not specific to PV time?

> +
> +PV_TIME_ST
> +    Function ID:  (uint32)  : 0xC5000022
> +    Return value: (int64)   : IPA of the stolen time data structure for this
> +                              (V)CPU. On failure:
> +                              NOT_SUPPORTED (-1)
> +

Is the size implicit? What are the memory attributes? This either needs
documenting here, or point to the right bit to the spec.

> +Stolen Time
> +-----------
> +
> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
> +
> +  Field       | Byte Length | Byte Offset | Description
> +  ----------- | ----------- | ----------- | --------------------------
> +  Revision    |      4      |      0      | Must be 0 for version 0.1
> +  Attributes  |      4      |      4      | Must be 0
> +  Stolen time |      8      |      8      | Stolen time in unsigned
> +              |             |             | nanoseconds indicating how
> +              |             |             | much time this VCPU thread
> +              |             |             | was involuntarily not
> +              |             |             | running on a physical CPU.
> +
> +The structure will be updated by the hypervisor periodically as time is stolen

Is it really periodic? If so, when is the update frequency?

> +from the VCPU. It will be present within a reserved region of the normal
> +memory given to the guest. The guest should not attempt to write into this
> +memory. There is a structure by VCPU of the guest.

What if the vcpu writes to it? Does it get a fault? If there is a
structure per vcpu, what is the layout in memory? How does a vcpu find
its own data structure? Is that the address returned by PV_TIME_ST?

> +
> +User space interface
> +====================
> +
> +User space can request that KVM provide the paravirtualized time interface to
> +a guest by creating a KVM_DEV_TYPE_ARM_PV_TIME device, for example:
> +
> +    struct kvm_create_device pvtime_device = {
> +            .type = KVM_DEV_TYPE_ARM_PV_TIME,
> +            .attr = 0,
> +            .flags = 0,
> +    };
> +
> +    pvtime_fd = ioctl(vm_fd, KVM_CREATE_DEVICE, &pvtime_device);
> +
> +The guest IPA of the structures must be given to KVM. This is the base address

nit: s/guest //

> +of an array of stolen time structures (one for each VCPU). For example:
> +
> +    struct kvm_device_attr st_base = {
> +            .group = KVM_DEV_ARM_PV_TIME_PADDR,
> +            .attr = KVM_DEV_ARM_PV_TIME_ST,
> +            .addr = (u64)(unsigned long)&st_paddr
> +    };
> +
> +    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &st_base);

So the allocation itself is performed by the kernel? What are the
ordering requirements between creating vcpus and the device? What are
the alignment requirements for the base address?

> +
> +For migration (or save/restore) of a guest it is necessary to save the contents
> +of the shared page(s) and later restore them. KVM_DEV_ARM_PV_TIME_STATE_SIZE
> +provides the size of this data and KVM_DEV_ARM_PV_TIME_STATE allows the state
> +to be read/written.

Is the size variable depending on the number of vcpus?

> +
> +It is also necessary for the physical address to be set identically when
> +restoring.
> +
> +    void *save_state(int fd, u64 attr, u32 *size) {
> +        struct kvm_device_attr get_size = {
> +                .group = KVM_DEV_ARM_PV_TIME_STATE_SIZE,
> +                .attr = attr,
> +                .addr = (u64)(unsigned long)size
> +        };
> +
> +        ioctl(fd, KVM_GET_DEVICE_ATTR, get_size);
> +
> +        void *buffer = malloc(*size);
> +
> +        struct kvm_device_attr get_state = {
> +                .group = KVM_DEV_ARM_PV_TIME_STATE,
> +                .attr = attr,
> +                .addr = (u64)(unsigned long)size
> +        };
> +
> +        ioctl(fd, KVM_GET_DEVICE_ATTR, buffer);
> +    }
> +
> +    void *st_state = save_state(pvtime_fd, KVM_DEV_ARM_PV_TIME_ST, &st_size);
> +

Thanks,

	M.
Zenghui Yu Aug. 5, 2019, 3:23 a.m. UTC | #2
Hi Steven,

On 2019/8/2 22:50, Steven Price wrote:
> Introduce a paravirtualization interface for KVM/arm64 based on the
> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
> 
> This only adds the details about "Stolen Time" as the details of "Live
> Physical Time" have not been fully agreed.
> 
> User space can specify a reserved area of memory for the guest and
> inform KVM to populate the memory with information on time that the host
> kernel has stolen from the guest.
> 
> A hypercall interface is provided for the guest to interrogate the
> hypervisor's support for this interface and the location of the shared
> memory structures.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>   Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++++++++++++
>   1 file changed, 107 insertions(+)
>   create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
                                     ^^^^^^^
This directory has been renamed recently, see:

https://patchwork.ozlabs.org/patch/1136104/


Zenghui

> 
> diff --git a/Documentation/virtual/kvm/arm/pvtime.txt b/Documentation/virtual/kvm/arm/pvtime.txt
> new file mode 100644
> index 000000000000..e6ae9799e1d5
> --- /dev/null
> +++ b/Documentation/virtual/kvm/arm/pvtime.txt
> @@ -0,0 +1,107 @@
> +Paravirtualized time support for arm64
> +======================================
> +
> +Arm specification DEN0057/A defined a standard for paravirtualised time
> +support for Aarch64 guests:
> +
> +https://developer.arm.com/docs/den0057/a
> +
> +KVM/Arm64 implements the stolen time part of this specification by providing
> +some hypervisor service calls to support a paravirtualized guest obtaining a
> +view of the amount of time stolen from its execution.
> +
> +Two new SMCCC compatible hypercalls are defined:
> +
> +PV_FEATURES 0xC5000020
> +PV_TIME_ST  0xC5000022
> +
> +These are only available in the SMC64/HVC64 calling convention as
> +paravirtualized time is not available to 32 bit Arm guests.
> +
> +PV_FEATURES
> +    Function ID:  (uint32)  : 0xC5000020
> +    PV_func_id:   (uint32)  : Either PV_TIME_LPT or PV_TIME_ST
> +    Return value: (int32)   : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
> +                              PV-time feature is supported by the hypervisor.
> +
> +PV_TIME_ST
> +    Function ID:  (uint32)  : 0xC5000022
> +    Return value: (int64)   : IPA of the stolen time data structure for this
> +                              (V)CPU. On failure:
> +                              NOT_SUPPORTED (-1)
> +
> +Stolen Time
> +-----------
> +
> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
> +
> +  Field       | Byte Length | Byte Offset | Description
> +  ----------- | ----------- | ----------- | --------------------------
> +  Revision    |      4      |      0      | Must be 0 for version 0.1
> +  Attributes  |      4      |      4      | Must be 0
> +  Stolen time |      8      |      8      | Stolen time in unsigned
> +              |             |             | nanoseconds indicating how
> +              |             |             | much time this VCPU thread
> +              |             |             | was involuntarily not
> +              |             |             | running on a physical CPU.
> +
> +The structure will be updated by the hypervisor periodically as time is stolen
> +from the VCPU. It will be present within a reserved region of the normal
> +memory given to the guest. The guest should not attempt to write into this
> +memory. There is a structure by VCPU of the guest.
> +
> +User space interface
> +====================
> +
> +User space can request that KVM provide the paravirtualized time interface to
> +a guest by creating a KVM_DEV_TYPE_ARM_PV_TIME device, for example:
> +
> +    struct kvm_create_device pvtime_device = {
> +            .type = KVM_DEV_TYPE_ARM_PV_TIME,
> +            .attr = 0,
> +            .flags = 0,
> +    };
> +
> +    pvtime_fd = ioctl(vm_fd, KVM_CREATE_DEVICE, &pvtime_device);
> +
> +The guest IPA of the structures must be given to KVM. This is the base address
> +of an array of stolen time structures (one for each VCPU). For example:
> +
> +    struct kvm_device_attr st_base = {
> +            .group = KVM_DEV_ARM_PV_TIME_PADDR,
> +            .attr = KVM_DEV_ARM_PV_TIME_ST,
> +            .addr = (u64)(unsigned long)&st_paddr
> +    };
> +
> +    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &st_base);
> +
> +For migration (or save/restore) of a guest it is necessary to save the contents
> +of the shared page(s) and later restore them. KVM_DEV_ARM_PV_TIME_STATE_SIZE
> +provides the size of this data and KVM_DEV_ARM_PV_TIME_STATE allows the state
> +to be read/written.
> +
> +It is also necessary for the physical address to be set identically when
> +restoring.
> +
> +    void *save_state(int fd, u64 attr, u32 *size) {
> +        struct kvm_device_attr get_size = {
> +                .group = KVM_DEV_ARM_PV_TIME_STATE_SIZE,
> +                .attr = attr,
> +                .addr = (u64)(unsigned long)size
> +        };
> +
> +        ioctl(fd, KVM_GET_DEVICE_ATTR, get_size);
> +
> +        void *buffer = malloc(*size);
> +
> +        struct kvm_device_attr get_state = {
> +                .group = KVM_DEV_ARM_PV_TIME_STATE,
> +                .attr = attr,
> +                .addr = (u64)(unsigned long)size
> +        };
> +
> +        ioctl(fd, KVM_GET_DEVICE_ATTR, buffer);
> +    }
> +
> +    void *st_state = save_state(pvtime_fd, KVM_DEV_ARM_PV_TIME_ST, &st_size);
> +
>
Steven Price Aug. 5, 2019, 1:06 p.m. UTC | #3
On 05/08/2019 04:23, Zenghui Yu wrote:
> Hi Steven,
> 
> On 2019/8/2 22:50, Steven Price wrote:
>> Introduce a paravirtualization interface for KVM/arm64 based on the
>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>
>> This only adds the details about "Stolen Time" as the details of "Live
>> Physical Time" have not been fully agreed.
>>
>> User space can specify a reserved area of memory for the guest and
>> inform KVM to populate the memory with information on time that the host
>> kernel has stolen from the guest.
>>
>> A hypercall interface is provided for the guest to interrogate the
>> hypervisor's support for this interface and the location of the shared
>> memory structures.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>   Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++++++++++++
>>   1 file changed, 107 insertions(+)
>>   create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
>                                     ^^^^^^^
> This directory has been renamed recently, see:
> 
> https://patchwork.ozlabs.org/patch/1136104/

Thanks for pointing that out - I'll move it in the next version.

Steve
Steven Price Aug. 5, 2019, 1:06 p.m. UTC | #4
On 03/08/2019 12:13, Marc Zyngier wrote:
> On Fri,  2 Aug 2019 15:50:09 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
> [+Peter for the userspace aspect of things]
> 
> Hi Steve,
> 
>> Introduce a paravirtualization interface for KVM/arm64 based on the
>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>
>> This only adds the details about "Stolen Time" as the details of "Live
>> Physical Time" have not been fully agreed.
>>
>> User space can specify a reserved area of memory for the guest and
>> inform KVM to populate the memory with information on time that the host
>> kernel has stolen from the guest.
>>
>> A hypercall interface is provided for the guest to interrogate the
>> hypervisor's support for this interface and the location of the shared
>> memory structures.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++++++++++++
>>  1 file changed, 107 insertions(+)
>>  create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
>>
>> diff --git a/Documentation/virtual/kvm/arm/pvtime.txt b/Documentation/virtual/kvm/arm/pvtime.txt
>> new file mode 100644
>> index 000000000000..e6ae9799e1d5
>> --- /dev/null
>> +++ b/Documentation/virtual/kvm/arm/pvtime.txt
>> @@ -0,0 +1,107 @@
>> +Paravirtualized time support for arm64
>> +======================================
>> +
>> +Arm specification DEN0057/A defined a standard for paravirtualised time
>> +support for Aarch64 guests:
> 
> nit: AArch64
> 
>> +
>> +https://developer.arm.com/docs/den0057/a
> 
> Between this file and the above document, which one is authoritative?

The above document should be authoritative - although I'm still waiting
for the final version to be published. I'm not expecting any changes to
the stolen time part though.

>> +
>> +KVM/Arm64 implements the stolen time part of this specification by providing
> 
> nit: KVM/arm64
> 
>> +some hypervisor service calls to support a paravirtualized guest obtaining a
>> +view of the amount of time stolen from its execution.
>> +
>> +Two new SMCCC compatible hypercalls are defined:
>> +
>> +PV_FEATURES 0xC5000020
>> +PV_TIME_ST  0xC5000022
>> +
>> +These are only available in the SMC64/HVC64 calling convention as
>> +paravirtualized time is not available to 32 bit Arm guests.
>> +
>> +PV_FEATURES
>> +    Function ID:  (uint32)  : 0xC5000020
>> +    PV_func_id:   (uint32)  : Either PV_TIME_LPT or PV_TIME_ST
>> +    Return value: (int32)   : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
>> +                              PV-time feature is supported by the hypervisor.
> 
> How is PV_FEATURES discovered? Is the intention to make it a generic
> ARM-wide PV discovery mechanism, not specific to PV time?

SMCCC is mandated for PV time. So, assuming the hypervisor supports
SMCCC, the "NOT_SUPPORTED" return is mandated by SMCCC if PV time isn't
supported.

However, we do also use the SMCCC_ARCH_FEATURES mechanism to check the
existence of PV_FEATURES before use. I'll update the document to call
this out.

>> +
>> +PV_TIME_ST
>> +    Function ID:  (uint32)  : 0xC5000022
>> +    Return value: (int64)   : IPA of the stolen time data structure for this
>> +                              (V)CPU. On failure:
>> +                              NOT_SUPPORTED (-1)
>> +
> 
> Is the size implicit? What are the memory attributes? This either needs
> documenting here, or point to the right bit to the spec.

The size is implicit - it's a pointer to the below structure, so the
guest can only rely on the first 16 bytes being valid. The memory
attributes are described in the specification as:

"The calling guest can map the IPA into normal memory with inner and
outer write back caching attributes, in the inner sharable domain"

I'll put those details in this document for completeness.

>> +Stolen Time
>> +-----------
>> +
>> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
>> +
>> +  Field       | Byte Length | Byte Offset | Description
>> +  ----------- | ----------- | ----------- | --------------------------
>> +  Revision    |      4      |      0      | Must be 0 for version 0.1
>> +  Attributes  |      4      |      4      | Must be 0
>> +  Stolen time |      8      |      8      | Stolen time in unsigned
>> +              |             |             | nanoseconds indicating how
>> +              |             |             | much time this VCPU thread
>> +              |             |             | was involuntarily not
>> +              |             |             | running on a physical CPU.
>> +
>> +The structure will be updated by the hypervisor periodically as time is stolen
> 
> Is it really periodic? If so, when is the update frequency?

Hmm, periodic might be the wrong term - there is no guaranteed frequency
of update. The spec says:

"The hypervisor must update this value prior to scheduling a virtual CPU"

I guess that's probably the best description.

>> +from the VCPU. It will be present within a reserved region of the normal
>> +memory given to the guest. The guest should not attempt to write into this
>> +memory. There is a structure by VCPU of the guest.
> 
> What if the vcpu writes to it? Does it get a fault?

From the perspective from the specification this is undefined. A fault
would therefore be acceptable but isn't generated in the implementation
defined here.

> If there is a
> structure per vcpu, what is the layout in memory? How does a vcpu find
> its own data structure? Is that the address returned by PV_TIME_ST?

A call to PV_TIME_ST returns the structure for the calling vCPU - I'll
make that explicit. The layout is therefore defined by the hypervisor
and cannot be relied on by the guest. As below this implementation uses
a simple array of structures.

>> +
>> +User space interface
>> +====================
>> +
>> +User space can request that KVM provide the paravirtualized time interface to
>> +a guest by creating a KVM_DEV_TYPE_ARM_PV_TIME device, for example:
>> +
>> +    struct kvm_create_device pvtime_device = {
>> +            .type = KVM_DEV_TYPE_ARM_PV_TIME,
>> +            .attr = 0,
>> +            .flags = 0,
>> +    };
>> +
>> +    pvtime_fd = ioctl(vm_fd, KVM_CREATE_DEVICE, &pvtime_device);
>> +
>> +The guest IPA of the structures must be given to KVM. This is the base address
> 
> nit: s/guest //
> 
>> +of an array of stolen time structures (one for each VCPU). For example:
>> +
>> +    struct kvm_device_attr st_base = {
>> +            .group = KVM_DEV_ARM_PV_TIME_PADDR,
>> +            .attr = KVM_DEV_ARM_PV_TIME_ST,
>> +            .addr = (u64)(unsigned long)&st_paddr
>> +    };
>> +
>> +    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &st_base);
> 
> So the allocation itself is performed by the kernel? What are the
> ordering requirements between creating vcpus and the device? What are
> the alignment requirements for the base address?

The base address should be page aligned - I'll spell that out.

There are currently no ordering requirements between creating vcpus and
the device. However...

>> +
>> +For migration (or save/restore) of a guest it is necessary to save the contents
>> +of the shared page(s) and later restore them. KVM_DEV_ARM_PV_TIME_STATE_SIZE
>> +provides the size of this data and KVM_DEV_ARM_PV_TIME_STATE allows the state
>> +to be read/written.
> 
> Is the size variable depending on the number of vcpus?

...yes - so restoring the state after migration must be done after
creating the vcpus. I'll point out that the device should created after.

Thanks for the review,

Steve
Christophe de Dinechin Aug. 5, 2019, 4:40 p.m. UTC | #5
Steven Price writes:

> Introduce a paravirtualization interface for KVM/arm64 based on the
> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>
> This only adds the details about "Stolen Time" as the details of "Live
> Physical Time" have not been fully agreed.
>
[...]

> +
> +Stolen Time
> +-----------
> +
> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
> +
> +  Field       | Byte Length | Byte Offset | Description
> +  ----------- | ----------- | ----------- | --------------------------
> +  Revision    |      4      |      0      | Must be 0 for version 0.1
> +  Attributes  |      4      |      4      | Must be 0
> +  Stolen time |      8      |      8      | Stolen time in unsigned
> +              |             |             | nanoseconds indicating how
> +              |             |             | much time this VCPU thread
> +              |             |             | was involuntarily not
> +              |             |             | running on a physical CPU.

I know very little about the topic, but I don't understand how the spec
as proposed allows an accurate reading of the relation between physical
time and stolen time simultaneously. In other words, could you draw
Figure 1 of the spec from within the guest? Or is it a non-objective?

For example, if you read the stolen time before you read CNTVCT_EL0,
isn't it possible for a lengthy event like a migration to occur between
the two reads, causing the stolen time to be obsolete and off by seconds?

--
Cheers,
Christophe de Dinechin (IRC c3d)
Steven Price Aug. 7, 2019, 1:21 p.m. UTC | #6
On 05/08/2019 17:40, Christophe de Dinechin wrote:
> 
> Steven Price writes:
> 
>> Introduce a paravirtualization interface for KVM/arm64 based on the
>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>
>> This only adds the details about "Stolen Time" as the details of "Live
>> Physical Time" have not been fully agreed.
>>
> [...]
> 
>> +
>> +Stolen Time
>> +-----------
>> +
>> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
>> +
>> +  Field       | Byte Length | Byte Offset | Description
>> +  ----------- | ----------- | ----------- | --------------------------
>> +  Revision    |      4      |      0      | Must be 0 for version 0.1
>> +  Attributes  |      4      |      4      | Must be 0
>> +  Stolen time |      8      |      8      | Stolen time in unsigned
>> +              |             |             | nanoseconds indicating how
>> +              |             |             | much time this VCPU thread
>> +              |             |             | was involuntarily not
>> +              |             |             | running on a physical CPU.
> 
> I know very little about the topic, but I don't understand how the spec
> as proposed allows an accurate reading of the relation between physical
> time and stolen time simultaneously. In other words, could you draw
> Figure 1 of the spec from within the guest? Or is it a non-objective?

Figure 1 is mostly attempting to explain Live Physical Time (LPT), which
is not part of this patch series. But it does touch on stolen time by
the difference between "live physical time" and "virtual time".

I'm not sure what you mean by "from within the guest". From the
perspective of the guest the parts of the diagram where the guest isn't
running don't exist (therefore there are discontinuities in the
"physical time" and "live physical time" lines).

This patch series doesn't attempt to provide the guest with a view of
"physical time" (or LPT) - but it might be able to observe that by
consulting something external (e.g. an NTP server, or an emulated RTC
which reports wall-clock time).

What it does provide is a mechanism for obtaining the difference (as
reported by the host) between "live physical time" and "virtual time" -
this is reported in nanoseconds in the above structure.

> For example, if you read the stolen time before you read CNTVCT_EL0,
> isn't it possible for a lengthy event like a migration to occur between
> the two reads, causing the stolen time to be obsolete and off by seconds?

"Lengthy events" like migration are represented by the "paused" state in
the diagram - i.e. it's the difference between "physical time" and "live
physical time". So stolen time doesn't attempt to represent that.

And yes, there is a race between reading CNTVCT_EL0 and reading stolen
time - but in practice this doesn't really matter. The usual pseudo-code
way of using stolen time is:

  * scheduler captures stolen time from structure and CNTVCT_EL0:
      before_timer = CNTVCT_EL0
      before_stolen = stolen
  * schedule in process
  * process is pre-empted (or blocked in some way)
  * scheduler captures stolen time from structure and CNTVCT_EL0:
      after_timer = CNTVCT_EL0
      after_stolen = stolen
      time = to_nsecs(after_timer - before_timer) -
             (after_stolen - before_stolen)

The scheduler can then charge the process for "time" nanoseconds of
time. This ensures that a process isn't unfairly penalised if the host
doesn't schedule the VCPU while it is supposed to be running.

The race is very small in comparison to the time the process is running,
and in the worst case just means the process is charged slightly more
(or less) than it should be.

I guess if you're really worried about it, you could do a dance like:

	do {
		before = stolen
		timer = CNTVCT_EL0
		after = stolen
	} while (before != after);

But I don't see the need to have such an accurate view of elapsed time
that the VCPU was scheduled. And of course at the moment (without this
series) the guest has no idea about time stolen by the host.

Steve
Steven Price Aug. 7, 2019, 3:26 p.m. UTC | #7
On 07/08/2019 15:28, Christophe de Dinechin wrote:
> 
> 
>> On 7 Aug 2019, at 15:21, Steven Price <steven.price@arm.com
>> <mailto:steven.price@arm.com>> wrote:
>>
>> On 05/08/2019 17:40, Christophe de Dinechin wrote:
>>>
>>> Steven Price writes:
>>>
>>>> Introduce a paravirtualization interface for KVM/arm64 based on the
>>>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>>>
>>>> This only adds the details about "Stolen Time" as the details of "Live
>>>> Physical Time" have not been fully agreed.
>>>>
>>> [...]
>>>
>>>> +
>>>> +Stolen Time
>>>> +-----------
>>>> +
>>>> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
>>>> +
>>>> +  Field       | Byte Length | Byte Offset | Description
>>>> +  ----------- | ----------- | ----------- | --------------------------
>>>> +  Revision    |      4      |      0      | Must be 0 for version 0.1
>>>> +  Attributes  |      4      |      4      | Must be 0
>>>> +  Stolen time |      8      |      8      | Stolen time in unsigned
>>>> +              |             |             | nanoseconds indicating how
>>>> +              |             |             | much time this VCPU thread
>>>> +              |             |             | was involuntarily not
>>>> +              |             |             | running on a physical CPU.
>>>
>>> I know very little about the topic, but I don't understand how the spec
>>> as proposed allows an accurate reading of the relation between physical
>>> time and stolen time simultaneously. In other words, could you draw
>>> Figure 1 of the spec from within the guest? Or is it a non-objective?
>>
>> Figure 1 is mostly attempting to explain Live Physical Time (LPT), which
>> is not part of this patch series. But it does touch on stolen time by
>> the difference between "live physical time" and "virtual time".
>>
>> I'm not sure what you mean by "from within the guest". From the
>> perspective of the guest the parts of the diagram where the guest isn't
>> running don't exist (therefore there are discontinuities in the
>> "physical time" and "live physical time" lines).
> 
> I meant: If I run code within the guest that attempts to draw Figure 1,
> race conditions may cause the diagram actually drawn by your guest
> program to look completely wrong on occasions.
> 
>> This patch series doesn't attempt to provide the guest with a view of
>> "physical time" (or LPT) - but it might be able to observe that by
>> consulting something external (e.g. an NTP server, or an emulated RTC
>> which reports wall-clock time).
> 
> … with what appear to be like a built-in race condition, as you correctly
> identified. I was wondering if the built-in race condition was deliberate
> and/or necessary, or if it was irrelevant for the planned uses of the value.
> 
>> What it does provide is a mechanism for obtaining the difference (as
>> reported by the host) between "live physical time" and "virtual time" -
>> this is reported in nanoseconds in the above structure.
>>
>>> For example, if you read the stolen time before you read CNTVCT_EL0,
>>> isn't it possible for a lengthy event like a migration to occur between
>>> the two reads, causing the stolen time to be obsolete and off by seconds?
>>
>> "Lengthy events" like migration are represented by the "paused" state in
>> the diagram - i.e. it's the difference between "physical time" and "live
>> physical time". So stolen time doesn't attempt to represent that.
>>
>> And yes, there is a race between reading CNTVCT_EL0 and reading stolen
>> time - but in practice this doesn't really matter. The usual pseudo-code
>> way of using stolen time is:
> 
> I’m assuming this is the guest scheduler you are talking about,

yes

> and I’m assuming virtualization can preempt that code anywhere.
> Maybe that’s where I’m wrong?

You are correct, the guest can be preempted at any point.

> 
> For the sake of the argument, assume there is a 1s pause.
> Not completely unreasonable in a migration scenario.

As I mentioned before, events like migration are not represented by
stolen time. They would be represented by CNTVCT_EL0 appearing to pause
during the migration (so showing a difference between "physical time"
and "live physical time"). The stolen time value would not be incremented.

>>  * scheduler captures stolen time from structure and CNTVCT_EL0:
>>      before_timer = CNTVCT_EL0
> 
> [insert optional 1s pause here, case A]
> 
>>      before_stolen = stolen
>>  * schedule in process
>>  * process is pre-empted (or blocked in some way)
>>  * scheduler captures stolen time from structure and CNTVCT_EL0:
>>      after_timer = CNTVCT_EL0
> 
> [insert optional 1s pause here, case B]
> 
>>      after_stolen = stolen
>>      time = to_nsecs(after_timer - before_timer) -
>>             (after_stolen - before_stolen)
> 
> In case A, time is too big by one second. In case B, it is too small,
> to the point where your code might need to be ready for
> “time” unexpectedly showing up as negative.

So a 1 second pause is unlikely for stolen time - this means that the
VCPU was ready to run, but the host didn't run it for some reason. But
in theory you are correct this could happen. The core code deals with it
like this (update_rq_clock_task):
> 	if (static_key_false((&paravirt_steal_rq_enabled))) {
> 		steal = paravirt_steal_clock(cpu_of(rq));
> 		steal -= rq->prev_steal_time_rq;
> 
> 		if (unlikely(steal > delta))
> 			steal = delta;
> 
> 		rq->prev_steal_time_rq += steal;
> 		delta -= steal;
> 	}

So if (steal > delta) then steal is capped to delta, preventing the
final delta from going negative.

>>
>> The scheduler can then charge the process for "time" nanoseconds of
>> time. This ensures that a process isn't unfairly penalised if the host
>> doesn't schedule the VCPU while it is supposed to be running.
>>
>> The race is very small in comparison to the time the process is running,
>> and in the worst case just means the process is charged slightly more
>> (or less) than it should be.
> 
> At this point, what I don’t understand is why the race would be
> “very small” or why you would only be charged “slightly” more or less?

The window between measuring the time using CNTVCT_EL0 and getting the
stolen time from the hypervisor is pretty short. The amount of time that
is (normally) stolen in one go is also small. So the race is unlikely
and the error when it occurs is (usually) small.

Long events (such as migration or pausing the guest) are not considered
"stolen time" and should be reflected to the guest in other ways.

>> I guess if you're really worried about it, you could do a dance like:
>>
>> do {
>> before = stolen
>> timer = CNTVCT_EL0
>> after = stolen
>> } while (before != after);
> 
> That will work as long as nothing in that loop requires something
> that would cause `stolen` to jump. If there is such a guarantee,
> then that’s even efficient, because in most cases the loop
> would only run once, at the cost of one extra read and one test.

Note that other architectures don't have such loops, so arm64 is just
following the lead of existing architecture.

>> But I don't see the need to have such an accurate view of elapsed time
>> that the VCPU was scheduled. And of course at the moment (without this
>> series) the guest has no idea about time stolen by the host.
> 
> I’m certainly not arguing that exposing stolen time is a bad idea,
> I’m only wondering if the proposed solution is racy, and if so, if
> it is intentional.
> 
> If it’s indeed racy, the problem could be mitigated in a number of
> ways
> 
> a) document your loop or something similar as being the recommended
> way to avoid the race, and then ensure that the loop actually
> will always work as intended. The upside is that it’s just a change in
> some comments or documentation.
> 
> b) having a single interface that exposes multiple times. For example,
> you could have a copy of CNTVCT_EL0 written alongside stolen time,
> and then the scheduler could use that copy for its decision.

That would still be racy - the structure can be updated at any time (as
the host could interrupt the VCPU at any time), so you would still be
left with the problem of reading both atomically - which would mean
going back to the loop. This is the approach that LPT takes and is
documented in the spec.

Also I can't see why you would want to know the CNTVCT_EL0 value at the
point the stolen time was updated, it's much more useful to know the
current CNTVCT_EL0 value.

Ultimately reading the stolen time is always going to be slightly racy
because you are including some of the scheduler's time in the
calculation of how much time the process was running for. The pauses you
describe above are instances where time has been stolen from the
scheduler, but that time is being accounted for/against a user space
process. While the algorithm could be changed so that it's always a
positive for the user space process I'm not sure that's a benefit (it's
probably better that statistically it can go either way).

Steve
diff mbox series

Patch

diff --git a/Documentation/virtual/kvm/arm/pvtime.txt b/Documentation/virtual/kvm/arm/pvtime.txt
new file mode 100644
index 000000000000..e6ae9799e1d5
--- /dev/null
+++ b/Documentation/virtual/kvm/arm/pvtime.txt
@@ -0,0 +1,107 @@ 
+Paravirtualized time support for arm64
+======================================
+
+Arm specification DEN0057/A defined a standard for paravirtualised time
+support for Aarch64 guests:
+
+https://developer.arm.com/docs/den0057/a
+
+KVM/Arm64 implements the stolen time part of this specification by providing
+some hypervisor service calls to support a paravirtualized guest obtaining a
+view of the amount of time stolen from its execution.
+
+Two new SMCCC compatible hypercalls are defined:
+
+PV_FEATURES 0xC5000020
+PV_TIME_ST  0xC5000022
+
+These are only available in the SMC64/HVC64 calling convention as
+paravirtualized time is not available to 32 bit Arm guests.
+
+PV_FEATURES
+    Function ID:  (uint32)  : 0xC5000020
+    PV_func_id:   (uint32)  : Either PV_TIME_LPT or PV_TIME_ST
+    Return value: (int32)   : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
+                              PV-time feature is supported by the hypervisor.
+
+PV_TIME_ST
+    Function ID:  (uint32)  : 0xC5000022
+    Return value: (int64)   : IPA of the stolen time data structure for this
+                              (V)CPU. On failure:
+                              NOT_SUPPORTED (-1)
+
+Stolen Time
+-----------
+
+The structure pointed to by the PV_TIME_ST hypercall is as follows:
+
+  Field       | Byte Length | Byte Offset | Description
+  ----------- | ----------- | ----------- | --------------------------
+  Revision    |      4      |      0      | Must be 0 for version 0.1
+  Attributes  |      4      |      4      | Must be 0
+  Stolen time |      8      |      8      | Stolen time in unsigned
+              |             |             | nanoseconds indicating how
+              |             |             | much time this VCPU thread
+              |             |             | was involuntarily not
+              |             |             | running on a physical CPU.
+
+The structure will be updated by the hypervisor periodically as time is stolen
+from the VCPU. It will be present within a reserved region of the normal
+memory given to the guest. The guest should not attempt to write into this
+memory. There is a structure by VCPU of the guest.
+
+User space interface
+====================
+
+User space can request that KVM provide the paravirtualized time interface to
+a guest by creating a KVM_DEV_TYPE_ARM_PV_TIME device, for example:
+
+    struct kvm_create_device pvtime_device = {
+            .type = KVM_DEV_TYPE_ARM_PV_TIME,
+            .attr = 0,
+            .flags = 0,
+    };
+
+    pvtime_fd = ioctl(vm_fd, KVM_CREATE_DEVICE, &pvtime_device);
+
+The guest IPA of the structures must be given to KVM. This is the base address
+of an array of stolen time structures (one for each VCPU). For example:
+
+    struct kvm_device_attr st_base = {
+            .group = KVM_DEV_ARM_PV_TIME_PADDR,
+            .attr = KVM_DEV_ARM_PV_TIME_ST,
+            .addr = (u64)(unsigned long)&st_paddr
+    };
+
+    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &st_base);
+
+For migration (or save/restore) of a guest it is necessary to save the contents
+of the shared page(s) and later restore them. KVM_DEV_ARM_PV_TIME_STATE_SIZE
+provides the size of this data and KVM_DEV_ARM_PV_TIME_STATE allows the state
+to be read/written.
+
+It is also necessary for the physical address to be set identically when
+restoring.
+
+    void *save_state(int fd, u64 attr, u32 *size) {
+        struct kvm_device_attr get_size = {
+                .group = KVM_DEV_ARM_PV_TIME_STATE_SIZE,
+                .attr = attr,
+                .addr = (u64)(unsigned long)size
+        };
+
+        ioctl(fd, KVM_GET_DEVICE_ATTR, get_size);
+
+        void *buffer = malloc(*size);
+
+        struct kvm_device_attr get_state = {
+                .group = KVM_DEV_ARM_PV_TIME_STATE,
+                .attr = attr,
+                .addr = (u64)(unsigned long)size
+        };
+
+        ioctl(fd, KVM_GET_DEVICE_ATTR, buffer);
+    }
+
+    void *st_state = save_state(pvtime_fd, KVM_DEV_ARM_PV_TIME_ST, &st_size);
+