diff mbox series

[v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute

Message ID 20230202165950.483430-1-sveith@amazon.de (mailing list archive)
State New, archived
Headers show
Series [v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute | expand

Commit Message

Veith, Simon Feb. 2, 2023, 4:59 p.m. UTC
The Time Stamp Counter (TSC) value register can be set to an absolute
value using the KVM_SET_MSRS ioctl. Since this is a per-vCPU register,
and vCPUs are often managed by separate threads, setting a uniform TSC
value across all vCPUs is challenging: After liveupdate or live
migration, the TSC value may need to be adjusted to account for the
incurred downtime.

Ideally, we want such an adjustment to happen uniformly across all
vCPUs; however, if we compute the offset centrally, the TSC value may
become out of date due to scheduling delays by the time that each vCPU
thread gets around to issuing KVM_SET_MSRS. Such delays can lead to
unaccounted pause time (the guest observes that its system clock has
fallen behind the NTP reference time).

To avoid such inaccuracies from the use of KVM_SET_MSRS, there is an
alternative attribute KVM_VCPU_TSC_OFFSET which, rather than setting an
absolute TSC value, defines it in terms of the offset to be applied
relative to the host's TSC value. Using this attribute, the TSC can be
adjusted reliably by userspace, but only if TSC scaling remains
unchanged, i.e., in the case of liveupdate on the same host, and not
when live migrating an instance between hosts with different TSC scaling
parameters.

In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
preserve the TSC value and apply a known offset would require
duplicating the TSC scaling computations in userspace to account for
frequency differences between source and destination TSCs.

Hence, if userspace wants to set the TSC to some known value without
having to deal with TSC scaling, and while also being resilient against
scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
suitable options.

Add a new TSC attribute KVM_VCPU_TSC_VALUE that allows for setting the
TSC value in a way that is unaffected by scheduling delays, handling TSC
scaling internally.

Add an optional, KVM clock based time reference argument to
kvm_synchronize_tsc(). This argument, if present, is understood to mean
"the TSC value being written was valid at this corresponding KVM clock
time point".

Userspace provides a struct kvm_vcpu_tsc_value consisting of a matched
pair of ( guest TSC value, KVM clock value ). The TSC value that will
ultimately be written is adjusted to account for the time which has
elapsed since the given KVM clock time point.

In order to allow userspace to retrieve an accurate time reference
atomically, without being affected by scheduling delays between
KVM_GET_CLOCK and KVM_GET_MSRS, the KVM_GET_DEVICE_ATTR implementation
for this attribute uses get_kvmclock() internally and returns a struct
kvm_vcpu_tsc_value with both values in one go. If get_kvmclock()
supports the KVM_CLOCK_HOST_TSC flag, the two will be based on one and
the same host TSC reading.

Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Simon Veith <sveith@amazon.de>
---
V2:
 - Squashed into a single change
 - Added justification for introducing a new interface
 - Added missing Signed-off-by
 - Re-worded comment

 Documentation/virt/kvm/devices/vcpu.rst | 22 +++++++++
 arch/x86/include/uapi/asm/kvm.h         |  7 +++
 arch/x86/kvm/x86.c                      | 63 +++++++++++++++++++++++--
 tools/arch/x86/include/uapi/asm/kvm.h   |  7 +++
 4 files changed, 94 insertions(+), 5 deletions(-)


base-commit: 9f266ccaa2f5228bfe67ad58a94ca4e0109b954a

Comments

Sean Christopherson March 15, 2023, 7:57 p.m. UTC | #1
Please don't send vN+1 In-Reply-To vN, the threading messes up lore and other
tooling, and becomes really problematic for humans when N gets large.

On Thu, Feb 02, 2023, Simon Veith wrote:
> In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
> preserve the TSC value and apply a known offset would require
> duplicating the TSC scaling computations in userspace to account for
> frequency differences between source and destination TSCs.
> 
> Hence, if userspace wants to set the TSC to some known value without
> having to deal with TSC scaling, and while also being resilient against
> scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
> suitable options.

Requiring userspace to handle certain aspects of TSC scaling doesn't seem
particularly onerous, at least not relative to all the other time insanity.  In
other words, why should KVM take on more complexity and a mostly-redundant uAPI?
David Woodhouse March 23, 2023, 11:26 p.m. UTC | #2
On Wed, 2023-03-15 at 12:57 -0700, Sean Christopherson wrote:
> 
> On Thu, Feb 02, 2023, Simon Veith wrote:
> > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
> > preserve the TSC value and apply a known offset would require
> > duplicating the TSC scaling computations in userspace to account for
> > frequency differences between source and destination TSCs.
> > 
> > Hence, if userspace wants to set the TSC to some known value without
> > having to deal with TSC scaling, and while also being resilient against
> > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
> > suitable options.
> 
> Requiring userspace to handle certain aspects of TSC scaling doesn't seem
> particularly onerous, at least not relative to all the other time insanity.  In
> other words, why should KVM take on more complexity and a mostly-redundant uAPI?

Because it offers userspace a simple and easy to use API. You get the
time information, you put the time information.

Now, if you want to *document* the actual calculations that userspace
would be forced to do just to preserve the existing relationship
between the guest TSC and the time of day / host TSC (for LM / LU
respectively).... well, you can try, but it exceeded my "if it needs
documenting this much, fix it first" threshold.

Does userspace even *have* what it needs to do this *right*?
Paolo Bonzini March 24, 2023, 11:22 a.m. UTC | #3
On 3/15/23 20:57, Sean Christopherson wrote:
>> In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
>> preserve the TSC value and apply a known offset would require
>> duplicating the TSC scaling computations in userspace to account for
>> frequency differences between source and destination TSCs.
>>
>> Hence, if userspace wants to set the TSC to some known value without
>> having to deal with TSC scaling, and while also being resilient against
>> scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
>> suitable options.
>
> Requiring userspace to handle certain aspects of TSC scaling doesn't seem
> particularly onerous, at least not relative to all the other time insanity.  In
> other words, why should KVM take on more complexity and a mostly-redundant uAPI?

Yeah, it seems like the problem is that KVM_GET_CLOCK return host 
unscaled TSC units (which was done because the guest TSC frequency is at 
least in theory per-CPU, and KVM_GET_CLOCK is a vm ioctl)?

Perhaps it's more important (uAPI-wise) for KVM to return the precise 
guest/host TSC ratio via a vcpu device attribute?

Paolo
David Woodhouse March 24, 2023, 1:08 p.m. UTC | #4
On Fri, 2023-03-24 at 12:22 +0100, Paolo Bonzini wrote:
> On 3/15/23 20:57, Sean Christopherson wrote:
> > > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
> > > preserve the TSC value and apply a known offset would require
> > > duplicating the TSC scaling computations in userspace to account for
> > > frequency differences between source and destination TSCs.
> > > 
> > > Hence, if userspace wants to set the TSC to some known value without
> > > having to deal with TSC scaling, and while also being resilient against
> > > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
> > > suitable options.
> > 
> > Requiring userspace to handle certain aspects of TSC scaling doesn't seem
> > particularly onerous, at least not relative to all the other time insanity.  In
> > other words, why should KVM take on more complexity and a mostly-redundant uAPI?
> 
> Yeah, it seems like the problem is that KVM_GET_CLOCK return host 
> unscaled TSC units (which was done because the guest TSC frequency is at 
> least in theory per-CPU, and KVM_GET_CLOCK is a vm ioctl)?
> 
> Perhaps it's more important (uAPI-wise) for KVM to return the precise
> guest/host TSC ratio via a vcpu device attribute?
> 

My criteria for this are that in the case of a live update (serialize
guest, kexec, resume guest on precisely the same hardware a few
milliseconds of steal time later), the guest clocks (KVM_CLOCK, TSC,
etc) shall be *precisely* the same as before with no slop. The same
offset, the same scaling. And ideally precisely the same values
advertised in the pvclock data to the guest.

In the case of live migration, we can't be cycle-accurate because of
course we're limited to the accuracy of the NTP sync between hosts. But
that is the *only* inaccuracy we shall incur, because we can express
clocks in terms of each other, e.g. <KVM clock was X at time of day Y>
and then e.g. <TSC was W at KVM clock X> is unchanged from before.

It's OK to expect userspace to do some calculation, as long as we never
expect userspace to magically perform calculations based on some
concept of "now", and to call kernel APIs, without any actual time
elapsing while it does so.

I said it's OK to expect userspace to do *some* calculation. But that
should be clearly documented, *and* when we document it, that
documentation shouldn't codify too much of the kernel's internal
relationships between clocks, and shouldn't make us ashamed to be
kernel engineers. 

We tried documenting it, in
https://lore.kernel.org/all/20220316045308.2313184-1-oupton@google.com/

I don't quite know how to summarise that thread, other than "it's too
broken; let's fix it first and *then* document it".

But if it can be done, I'm happy for someone to fix the documentation
in a way which describes how to meet the above criteria using the
existing kernel APIs. And then perhaps we can make a decision about
just how ashamed of ourselves we should be, and whether we want to
provide a better, easier API for userspace to use.
David Woodhouse Sept. 13, 2023, 2:08 p.m. UTC | #5
On Fri, 2023-03-24 at 12:22 +0100, Paolo Bonzini wrote:
> On 3/15/23 20:57, Sean Christopherson wrote:
> > > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
> > > preserve the TSC value and apply a known offset would require
> > > duplicating the TSC scaling computations in userspace to account for
> > > frequency differences between source and destination TSCs.
> > > 
> > > Hence, if userspace wants to set the TSC to some known value without
> > > having to deal with TSC scaling, and while also being resilient against
> > > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
> > > suitable options.
> > 
> > Requiring userspace to handle certain aspects of TSC scaling doesn't seem
> > particularly onerous, at least not relative to all the other time insanity.  In
> > other words, why should KVM take on more complexity and a mostly-redundant uAPI?
> 
> Yeah, it seems like the problem is that KVM_GET_CLOCK return host 
> unscaled TSC units (which was done because the guest TSC frequency is at 
> least in theory per-CPU, and KVM_GET_CLOCK is a vm ioctl)?
> 
> Perhaps it's more important (uAPI-wise) for KVM to return the precise
> guest/host TSC ratio via a vcpu device attribute?

Perhaps. Although that really does suck as a userspace experience. See
the patch I just sent.
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 31f14ec4a65b..240a3646947c 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -265,3 +265,25 @@  From the destination VMM process:
 
 7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
    respective value derived in the previous step.
+
+4.2 ATTRIBUTE: KVM_VCPU_TSC_VALUE
+
+:Parameters: kvm_device_attr.addr points to a struct kvm_vcpu_tsc_value
+
+Returns:
+
+	 ======= ======================================
+	 -EFAULT Error reading/writing the provided
+		 parameter address.
+	 -ENXIO  Attribute not supported
+	 ======= ======================================
+
+Gets or sets a matched pair of guest TSC value and KVM clock time point.
+
+When setting the TSC value through this attribute, a corresponding KVM clock
+reference time point (as retrieved by KVM_GET_CLOCK in the clock field) must be
+provided.
+
+The actual TSC value written will be adjusted based on the time that has
+elapsed since the provided reference time point, taking TSC scaling into
+account.
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index e48deab8901d..f99bdb959b54 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -528,5 +528,12 @@  struct kvm_pmu_event_filter {
 /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define   KVM_VCPU_TSC_VALUE 1 /* attribute for the TSC value */
+
+/* for KVM_VCPU_TSC_VALUE */
+struct kvm_vcpu_tsc_value {
+	__u64 tsc_val;
+	__u64 kvm_ns;
+};
 
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index da4bbd043a7b..b174200c909b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2655,7 +2655,7 @@  static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm_track_tsc_matching(vcpu);
 }
 
-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, u64 *kvm_ns)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
@@ -2664,12 +2664,24 @@  static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	bool synchronizing = false;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
-	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	ns = get_kvmclock_base_ns();
+
+	if (kvm_ns) {
+		/*
+		 * kvm_ns is the KVM clock reference time point at which this
+		 * TSC value was correct. Use this time point to compensate for
+		 * any delays that have been incurred since that TSC value was
+		 * valid.
+		 */
+		s64 delta_ns = ns + vcpu->kvm->arch.kvmclock_offset - *kvm_ns;
+		data += nsec_to_cycles(vcpu, (u64)delta_ns);
+	}
+
+	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
 	if (vcpu->arch.virtual_tsc_khz) {
-		if (data == 0) {
+		if (data == 0 && !kvm_ns) {
 			/*
 			 * detection of vcpu initialization -- need to sync
 			 * with other vCPUs. This particularly helps to keep
@@ -3672,7 +3684,7 @@  int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_TSC:
 		if (msr_info->host_initiated) {
-			kvm_synchronize_tsc(vcpu, data);
+			kvm_synchronize_tsc(vcpu, data, NULL);
 		} else {
 			u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
 			adjust_tsc_offset_guest(vcpu, adj);
@@ -5375,6 +5387,7 @@  static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
 
 	switch (attr->attr) {
 	case KVM_VCPU_TSC_OFFSET:
+	case KVM_VCPU_TSC_VALUE:
 		r = 0;
 		break;
 	default:
@@ -5400,6 +5413,32 @@  static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
 			break;
 		r = 0;
 		break;
+	case KVM_VCPU_TSC_VALUE: {
+		struct kvm_vcpu_tsc_value __user *tsc_value_arg;
+		struct kvm_vcpu_tsc_value tsc_value;
+		struct kvm_clock_data kvm_clock;
+		u64 host_tsc, guest_tsc, ratio, offset;
+
+		get_kvmclock(vcpu->kvm, &kvm_clock);
+		if (kvm_clock.flags & KVM_CLOCK_HOST_TSC)
+			host_tsc = kvm_clock.host_tsc;
+		else
+			host_tsc = rdtsc();
+
+		ratio = vcpu->arch.l1_tsc_scaling_ratio;
+		offset = vcpu->arch.l1_tsc_offset;
+		guest_tsc = kvm_scale_tsc(host_tsc, ratio) + offset;
+
+		tsc_value.kvm_ns = kvm_clock.clock;
+		tsc_value.tsc_val = guest_tsc;
+
+		tsc_value_arg = (struct kvm_vcpu_tsc_value __user *)uaddr;
+		r = -EFAULT;
+		if (copy_to_user(tsc_value_arg, &tsc_value, sizeof(tsc_value)))
+			break;
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
@@ -5442,6 +5481,20 @@  static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
 		r = 0;
 		break;
 	}
+	case KVM_VCPU_TSC_VALUE: {
+		struct kvm_vcpu_tsc_value __user *tsc_value_arg;
+		struct kvm_vcpu_tsc_value tsc_value;
+
+		tsc_value_arg = (struct kvm_vcpu_tsc_value __user *)uaddr;
+		r = -EFAULT;
+		if (copy_from_user(&tsc_value, tsc_value_arg, sizeof(tsc_value)))
+			break;
+
+		kvm_synchronize_tsc(vcpu, tsc_value.tsc_val, &tsc_value.kvm_ns);
+
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
@@ -11668,7 +11721,7 @@  void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 	if (mutex_lock_killable(&vcpu->mutex))
 		return;
 	vcpu_load(vcpu);
-	kvm_synchronize_tsc(vcpu, 0);
+	kvm_synchronize_tsc(vcpu, 0, NULL);
 	vcpu_put(vcpu);
 
 	/* poll control enabled by default */
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index e48deab8901d..f99bdb959b54 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -528,5 +528,12 @@  struct kvm_pmu_event_filter {
 /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define   KVM_VCPU_TSC_VALUE 1 /* attribute for the TSC value */
+
+/* for KVM_VCPU_TSC_VALUE */
+struct kvm_vcpu_tsc_value {
+	__u64 tsc_val;
+	__u64 kvm_ns;
+};
 
 #endif /* _ASM_X86_KVM_H */