diff mbox series

[06/10] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration

Message ID 20240418193528.41780-7-dwmw2@infradead.org (mailing list archive)
State New
Headers show
Series Cleaning up the KVM clock mess | expand

Commit Message

David Woodhouse April 18, 2024, 7:34 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully
inadequate. It ignores TSC scaling, and ignores the fact that the host
TSC may differ from one host to the next (and in fact because of the way
the kernel calibrates it, it generally differs from one boot to the next
even on the same hardware).

Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits,
and attempt to document the *awful* process that we're requiring userspace
to follow to merely preserve the TSC across migration.

I may have thrown up in my mouth a little when writing that documentation.
It's an awful API. If we do this, we should be ashamed of ourselves.
(I also haven't tested the documented process yet).

Let's use Simon's KVM_VCPU_TSC_VALUE instead.
https://lore.kernel.org/all/20230202165950.483430-1-sveith@amazon.de/

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 Documentation/virt/kvm/devices/vcpu.rst | 115 ++++++++++++++++++------
 arch/x86/include/uapi/asm/kvm.h         |   6 ++
 arch/x86/kvm/x86.c                      |  15 ++++
 3 files changed, 109 insertions(+), 27 deletions(-)

Comments

Durrant, Paul April 19, 2024, 3:49 p.m. UTC | #1
On 18/04/2024 20:34, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully
> inadequate. It ignores TSC scaling, and ignores the fact that the host
> TSC may differ from one host to the next (and in fact because of the way
> the kernel calibrates it, it generally differs from one boot to the next
> even on the same hardware).
> 
> Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits,
> and attempt to document the *awful* process that we're requiring userspace
> to follow to merely preserve the TSC across migration.
> 
> I may have thrown up in my mouth a little when writing that documentation.
> It's an awful API. If we do this, we should be ashamed of ourselves.
> (I also haven't tested the documented process yet).
> 
> Let's use Simon's KVM_VCPU_TSC_VALUE instead.
> https://lore.kernel.org/all/20230202165950.483430-1-sveith@amazon.de/
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   Documentation/virt/kvm/devices/vcpu.rst | 115 ++++++++++++++++++------
>   arch/x86/include/uapi/asm/kvm.h         |   6 ++
>   arch/x86/kvm/x86.c                      |  15 ++++
>   3 files changed, 109 insertions(+), 27 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>
David Woodhouse April 19, 2024, 3:53 p.m. UTC | #2
On Fri, 2024-04-19 at 16:49 +0100, Paul Durrant wrote:
> On 18/04/2024 20:34, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully
> > inadequate. It ignores TSC scaling, and ignores the fact that the host
> > TSC may differ from one host to the next (and in fact because of the way
> > the kernel calibrates it, it generally differs from one boot to the next
> > even on the same hardware).
> > 
> > Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits,
> > and attempt to document the *awful* process that we're requiring userspace
> > to follow to merely preserve the TSC across migration.
> > 
> > I may have thrown up in my mouth a little when writing that documentation.
> > It's an awful API. If we do this, we should be ashamed of ourselves.
> > (I also haven't tested the documented process yet).
> > 
> > Let's use Simon's KVM_VCPU_TSC_VALUE instead.
> > https://lore.kernel.org/all/20230202165950.483430-1-sveith@amazon.de/

Oops, I meant to change that commit message. I've changed my mind, as
KVM_VCPU_TSC_VALUE sets the TSC using the KVM clock as a reference...
which is utterly backwards, since the KVM clock is *defined* in terms
of the TSC.

Will do so before posting it again.

May also include the "loop over KVM_GET_CLOCK until tai_offset didn't
change" part to fix the leap seconbd problem.

I think we do need to implement a better way to get { TAI, host TSC }
than abusing KVM_GET_CLOCK when we don't even *care* about the KVM
clock. We need a way to migrate the arch counters on platforms other
than x86 too. But I'm trying to leave that part for later. At least on
x86 we *do* have a way to migrate the TSC accurately, even if it's
awful.

> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> >   Documentation/virt/kvm/devices/vcpu.rst | 115 ++++++++++++++++++------
> >   arch/x86/include/uapi/asm/kvm.h         |   6 ++
> >   arch/x86/kvm/x86.c                      |  15 ++++
> >   3 files changed, 109 insertions(+), 27 deletions(-)
> > 
> 
> Reviewed-by: Paul Durrant <paul@xen.org>
>
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 31f14ec4a65b..2a054443118d 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -216,52 +216,113 @@  Returns:
 Specifies the guest's TSC offset relative to the host's TSC. The guest's
 TSC is then derived by the following equation:
 
-  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
+  guest_tsc = (( host_tsc * tsc_scale_ratio ) >> tsc_scale_bits ) + KVM_VCPU_TSC_OFFSET
+
+The values of tsc_scale_ratio and tsc_scale_bits can be obtained using
+the KVM_VCPU_TSC_SCALE attribute.
 
 This attribute is useful to adjust the guest's TSC on live migration,
 so that the TSC counts the time during which the VM was paused. The
-following describes a possible algorithm to use for this purpose.
+following describes a possible algorithm to use for this purpose,
 
 From the source VMM process:
 
-1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
+1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (host_tsc_src),
    kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
-   (host_src).
+   (time_src) at a given moment (Tsrc).
+
+2. For each vCPU[i]:
+
+   a. Read the KVM_VCPU_TSC_OFFSET attribute to record the guest TSC offset
+      (ofs_src[i]).
 
-2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the
-   guest TSC offset (ofs_src[i]).
+   b. Read the KVM_VCPU_TSC_SCALE attribute to record the guest TSC scaling
+      ratio (ratio_src[i], frac_bits_src[i]).
 
-3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
-   guest's TSC (freq).
+   c. Use host_tsc_src and the scaling/offset factors to calculate this
+      vCPU's TSC at time Tsrc:
+      tsc_src[i] = (( host_tsc_src * ratio_src[i] ) >> frac_bits_src[i] ) + ofs_src[i]
+
+3. Invoke the KVM_GET_CLOCK_GUEST ioctl on the boot vCPU to return the KVM
+   clock as a function of the guest TSC (pvti_src).  (This ioctl not succeed
+   if the host and guest TSCs are not consistent and well-behaved.)
 
 From the destination VMM process:
 
-4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
-   kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective
+4. Before creating the vCPUs, invoke the KVM_SET_TSC_KHZ ioctl on the VM, to
+   set the scaled frequency of the guest's TSC (freq).
+
+5. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
+   kvmclock (guest_src) and CLOCK_REALTIME (time_src) in their respective
    fields.  Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
    structure.
 
-   KVM will advance the VM's kvmclock to account for elapsed time since
-   recording the clock values.  Note that this will cause problems in
+   KVM will restore the VM's kvmclock, accounting for elapsed time since
+   the clock values were recorded.  Note that this will cause problems in
    the guest (e.g., timeouts) unless CLOCK_REALTIME is synchronized
    between the source and destination, and a reasonably short time passes
-   between the source pausing the VMs and the destination executing
-   steps 4-7.
+   between the source pausing the VMs and the destination resuming them.
+   Due to the KVM_[SG]ET_CLOCK API using CLOCK_REALTIME instead of
+   CLOCK_TAI, leap seconds during the migration may also introduce errors.
+
+6. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (host_tsc_dst) and
+   host CLOCK_REALTIME nanoseconds (time_dst) at a given moment (Tdst).
+
+7. Calculate the number of nanoseconds elapsed between Tsrc and Tdst:
+   ΔT = time_dst - time_src
+
+8. As each vCPU[i] is created:
+
+   a. Read the KVM_VCPU_TSC_SCALE attribute to record the guest TSC scaling
+      ratio (ratio_dst[i], frac_bits_dst[i]).
+
+   b. Calculate the intended guest TSC value at time Tdst:
+      tsc_dst[i] = tsc_tsc[i] + (ΔT * freq[i])
+
+   c. Use host_tsc_dst and the scaling/offset factors to calculate this vCPU's
+      TSC at time Tsrc without taking offsetting into account:
+      raw_dst[i] = (( host_tsc_dst * ratio_dst[i] ) >> frac_bits_dst[i] )
+
+   d. Calculate ofs_src[i] = tsc_dst[i] + raw_dst[i] and set the resulting
+      offset using the KVM_VCPU_TSC_OFFSET attrribute.
 
-5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and
-   kvmclock nanoseconds (guest_dest).
+9. If pvti_src was provided, invoke the KVM_SET_CLOCK_GUEST ioctl on the boot
+   vCPU to restore the KVM clock as a precise function of the guest TSC. If
+   pvti_src was not provided by the source, or the ioctl fails on the
+   destination, the KVM clock is operating in its less precise mode where it
+   is defined in terms of the host CLOCK_MONOTONIC_RAW, so the value
+   previously set in step 5 is as accurate as it can be.
+
+4.2 ATTRIBUTE: KVM_VCPU_TSC_SCALE
+
+:Parameters: 64-bit fixed point TSC scale factor
+
+Returns:
+
+	 ======= ======================================
+	 -EFAULT Error reading the provided parameter
+		 address.
+	 -ENXIO  Attribute not supported
+	 -EINVAL Invalid request to write the attribute
+	 ======= ======================================
+
+This read-only attribute reports the guest's TSC scaling factor, in the form
+of a fixed-point number represented by the following structure:
+
+    struct kvm_vcpu_tsc_scale {
+	    __u64	tsc_ratio;
+	    __u64	tsc_frac_bits;
+    };
 
-6. Adjust the guest TSC offsets for every vCPU to account for (1) time
-   elapsed since recording state and (2) difference in TSCs between the
-   source and destination machine:
 
-   ofs_dst[i] = ofs_src[i] -
-     (guest_src - guest_dest) * freq +
-     (tsc_src - tsc_dest)
+The tsc_frac_bits field indicate the location of the fixed point, such that
+host TSC values are converted to guest TSC using the formula:
 
-   ("ofs[i] + tsc - guest * freq" is the guest TSC value corresponding to
-   a time of 0 in kvmclock.  The above formula ensures that it is the
-   same on the destination as it was on the source).
+    guest_tsc = ( ( host_tsc * tsc_ratio ) >> tsc_frac_bits) + offset
 
-7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
-   respective value derived in the previous step.
+Userspace can use this to precisely calculate the guest TSC from the host
+TSC at any given moment. This is needed for accurate migration of guests,
+as described in the documentation for the KVM_VCPU_TSC_OFFSET attribute.
+In conjunction with the KVM_GET_CLOCK_GUEST ioctl, it also provides a way
+for userspace to quickly calculate the KVM clock for a guest, to use as a
+time reference for hypercalls or emulation of other timer devices.
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 72ad5ace118d..fe7c98907818 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -864,6 +864,12 @@  struct kvm_hyperv_eventfd {
 /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define   KVM_VCPU_TSC_SCALE  1 /* attribute for TSC scaling factor */
+
+struct kvm_vcpu_tsc_scale {
+	__u64 tsc_ratio;
+	__u64 tsc_frac_bits;
+};
 
 /* x86-specific KVM_EXIT_HYPERCALL flags. */
 #define KVM_EXIT_HYPERCALL_LONG_MODE	_BITULL(0)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 42abce7b4fc9..00a7c1188dec 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5715,6 +5715,7 @@  static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
 
 	switch (attr->attr) {
 	case KVM_VCPU_TSC_OFFSET:
+	case KVM_VCPU_TSC_SCALE:
 		r = 0;
 		break;
 	default:
@@ -5737,6 +5738,17 @@  static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
 			break;
 		r = 0;
 		break;
+	case KVM_VCPU_TSC_SCALE: {
+		struct kvm_vcpu_tsc_scale scale;
+
+		scale.tsc_ratio = vcpu->arch.l1_tsc_scaling_ratio;
+		scale.tsc_frac_bits = kvm_caps.tsc_scaling_ratio_frac_bits;
+		r = -EFAULT;
+		if (copy_to_user(uaddr, &scale, sizeof(scale)))
+			break;
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
@@ -5777,6 +5789,9 @@  static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
 		r = 0;
 		break;
 	}
+	case KVM_VCPU_TSC_SCALE:
+		r = -EINVAL; /* Read only */
+		break;
 	default:
 		r = -ENXIO;
 	}