diff mbox series

KVM: x86/xen: improve accuracy of Xen timers

Message ID 96da7273adfff2a346de9a4a27ce064f6fe0d0a1.camel@infradead.org (mailing list archive)
State New, archived
Headers show
Series KVM: x86/xen: improve accuracy of Xen timers | expand

Commit Message

David Woodhouse Oct. 30, 2023, 3:50 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

A test program such as http://david.woodhou.se/timerlat.c confirms user
reports that timers are increasingly inaccurate as the lifetime of a
guest increases. Reporting the actual delay observed when asking for
100µs of sleep, it starts off OK on a newly-launched guest but gets
worse over time, giving incorrect sleep times:

root@ip-10-0-193-21:~# ./timerlat -c -n 5
00000000 latency 103243/100000 (3.2430%)
00000001 latency 103243/100000 (3.2430%)
00000002 latency 103242/100000 (3.2420%)
00000003 latency 103245/100000 (3.2450%)
00000004 latency 103245/100000 (3.2450%)

The biggest problem is that get_kvmclock_ns() returns inaccurate values
when the guest TSC is scaled. The guest sees a TSC value scaled from the
host TSC by a mul/shift conversion (hopefully done in hardware). The
guest then converts that guest TSC value into nanoseconds using the
mul/shift conversion given to it by the KVM pvclock information.

But get_kvmclock_ns() performs only a single conversion directly from
host TSC to nanoseconds, giving a different result. A test program at
http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
over a day.

It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
that. The actual guest hv_clock is per-CPU, and *theoretically* each
vCPU could be running at a *different* frequency. But this patch is
needed anyway because...

The other issue with Xen timers was that the code would snapshot the
host CLOCK_MONOTONIC at some point in time, and then... after a few
interrupts may have occurred, some preemption perhaps... would also read
the guest's kvmclock. Then it would proceed under the false assumption
that those two happened at the *same* time. Any time which *actually*
elapsed between reading the two clocks was introduced as inaccuracies
in the time at which the timer fired.

Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
host TSC just *once*, then use the returned TSC value to calculate the
kvmclock (making sure to do that the way the guest would instead of
making the same mistake get_kvmclock_ns() does).

Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
timers still have to use CLOCK_MONOTONIC. In practice the difference
between the two won't matter over the timescales involved, as the
*absolute* values don't matter; just the delta.

This does mean a new variant of kvm_get_time_and_clockread() is needed;
called kvm_get_monotonic_and_clockread() because that's what it does.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c |  30 ++++++++++++
 arch/x86/kvm/x86.h |   1 +
 arch/x86/kvm/xen.c | 111 +++++++++++++++++++++++++++++++--------------
 3 files changed, 109 insertions(+), 33 deletions(-)

Comments

Paul Durrant Oct. 31, 2023, 10:42 a.m. UTC | #1
On 30/10/2023 15:50, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> A test program such as http://david.woodhou.se/timerlat.c confirms user
> reports that timers are increasingly inaccurate as the lifetime of a
> guest increases. Reporting the actual delay observed when asking for
> 100µs of sleep, it starts off OK on a newly-launched guest but gets
> worse over time, giving incorrect sleep times:
> 
> root@ip-10-0-193-21:~# ./timerlat -c -n 5
> 00000000 latency 103243/100000 (3.2430%)
> 00000001 latency 103243/100000 (3.2430%)
> 00000002 latency 103242/100000 (3.2420%)
> 00000003 latency 103245/100000 (3.2450%)
> 00000004 latency 103245/100000 (3.2450%)
> 
> The biggest problem is that get_kvmclock_ns() returns inaccurate values
> when the guest TSC is scaled. The guest sees a TSC value scaled from the
> host TSC by a mul/shift conversion (hopefully done in hardware). The
> guest then converts that guest TSC value into nanoseconds using the
> mul/shift conversion given to it by the KVM pvclock information.
> 
> But get_kvmclock_ns() performs only a single conversion directly from
> host TSC to nanoseconds, giving a different result. A test program at
> http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
> over a day.
> 
> It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
> that. The actual guest hv_clock is per-CPU, and *theoretically* each
> vCPU could be running at a *different* frequency. But this patch is
> needed anyway because...
> 
> The other issue with Xen timers was that the code would snapshot the
> host CLOCK_MONOTONIC at some point in time, and then... after a few
> interrupts may have occurred, some preemption perhaps... would also read
> the guest's kvmclock. Then it would proceed under the false assumption
> that those two happened at the *same* time. Any time which *actually*
> elapsed between reading the two clocks was introduced as inaccuracies
> in the time at which the timer fired.
> 
> Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
> host TSC just *once*, then use the returned TSC value to calculate the
> kvmclock (making sure to do that the way the guest would instead of
> making the same mistake get_kvmclock_ns() does).
> 
> Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
> timers still have to use CLOCK_MONOTONIC. In practice the difference
> between the two won't matter over the timescales involved, as the
> *absolute* values don't matter; just the delta.
> 
> This does mean a new variant of kvm_get_time_and_clockread() is needed;
> called kvm_get_monotonic_and_clockread() because that's what it does.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/kvm/x86.c |  30 ++++++++++++
>   arch/x86/kvm/x86.h |   1 +
>   arch/x86/kvm/xen.c | 111 +++++++++++++++++++++++++++++++--------------
>   3 files changed, 109 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 41cce5031126..aeede83d65dc 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2863,6 +2863,25 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
>   	return mode;
>   }
>   
> +static int do_monotonic(s64 *t, u64 *tsc_timestamp)
> +{
> +	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> +	unsigned long seq;
> +	int mode;
> +	u64 ns;
> +
> +	do {
> +		seq = read_seqcount_begin(&gtod->seq);
> +		ns = gtod->clock.base_cycles;
> +		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
> +		ns >>= gtod->clock.shift;
> +		ns += ktime_to_ns(ktime_add(gtod->clock.offset, gtod->offs_boot));
> +	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
> +	*t = ns;
> +
> +	return mode;
> +}
> +
>   static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
>   {
>   	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> @@ -2895,6 +2914,17 @@ static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
>   						      tsc_timestamp));
>   }
>   
> +/* returns true if host is using TSC based clocksource */
> +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
> +{
> +	/* checked again under seqlock below */
> +	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
> +		return false;
> +
> +	return gtod_is_based_on_tsc(do_monotonic(kernel_ns,
> +						 tsc_timestamp));
> +}
> +
>   /* returns true if host is using TSC based clocksource */
>   static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
>   					   u64 *tsc_timestamp)
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 1e7be1f6ab29..c08c6f729965 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -293,6 +293,7 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
>   void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
>   
>   u64 get_kvmclock_ns(struct kvm *kvm);
> +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);
>   
>   int kvm_read_guest_virt(struct kvm_vcpu *vcpu,
>   	gva_t addr, void *val, unsigned int bytes,
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 0ea6016ad132..00a1e924a717 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -24,6 +24,7 @@
>   #include <xen/interface/sched.h>
>   
>   #include <asm/xen/cpuid.h>
> +#include <asm/pvclock.h>
>   
>   #include "cpuid.h"
>   #include "trace.h"
> @@ -144,17 +145,87 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
>   	return HRTIMER_NORESTART;
>   }
>   
> -static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
> +static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs,
> +				bool linux_wa)
>   {
> +	uint64_t guest_now;
> +	int64_t kernel_now, delta;
> +
> +	 /*
> +	 * The guest provides the requested timeout in absolute nanoseconds
> +	 * of the KVM clock — as *it* sees it, based on the scaled TSC and
> +	 * the pvclock information provided by KVM.
> +	 *
> +	 * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW
> +	 * so use CLOCK_MONOTONIC. In the timescales covered by timers, the
> +	 * difference won't matter much as there is no cumulative effect.
> +	 *
> +	 * Calculate the time for some arbitrary point in time around "now"
> +	 * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the
> +	 * delta between the kvmclock "now" value and the guest's requested
> +	 * timeout, apply the "Linux workaround" described below, and add
> +	 * the resulting delta to the CLOCK_MONOTONIC "now" value, to get
> +	 * the absolute CLOCK_MONOTONIC time at which the timer should
> +	 * fire.
> +	 */
> +	if (vcpu->kvm->arch.use_master_clock &&
> +	    static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> +		uint64_t host_tsc, guest_tsc;
> +
> +		if (!IS_ENABLED(CONFIG_64BIT) ||
> +		    !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) {
> +			/*
> +			 * Don't fall back to get_kvmclock_ns() because it's
> +			 * broken; it has a systemic error in its results
> +			 * because it scales directly from host TSC to
> +			 * nanoseconds, and doesn't scale first to guest TSC
> +			 * and then* to nanoseconds as the guest does.
> +			 *
> +			 * There is a small error introduced here because time
> +			 * continues to elapse between the ktime_get() and the
> +			 * subsequent rdtsc(). But not the systemic drift due
> +			 * to get_kvmclock_ns().
> +			 */
> +			kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */
> +			host_tsc = rdtsc();
> +		}
> +
> +		/* Calculate the guest kvmclock as the guest would do it. */
> +		guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> +		guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock, guest_tsc);
> +	} else {
> +		/* Without CONSTANT_TSC, get_kvmclock_ns() is the only option */
> +		guest_now = get_kvmclock_ns(vcpu->kvm);
> +		kernel_now = ktime_get();
> +	}
> +
> +	delta = guest_abs - guest_now;
> +
> +	/* Xen has a 'Linux workaround' in do_set_timer_op() which
> +	 * checks for negative absolute timeout values (caused by
> +	 * integer overflow), and for values about 13 days in the
> +	 * future (2^50ns) which would be caused by jiffies
> +	 * overflow. For those cases, it sets the timeout 100ms in
> +	 * the future (not *too* soon, since if a guest really did
> +	 * set a long timeout on purpose we don't want to keep
> +	 * churning CPU time by waking it up).
> +	 */
> +	if (linux_wa) {
> +		if ((unlikely((int64_t)guest_abs < 0 ||
> +			      (delta > 0 && (uint32_t) (delta >> 50) != 0)))) {
> +			delta = 100 * NSEC_PER_MSEC;
> +			guest_abs = guest_now + delta;
> +		}
> +	}
> +
>   	atomic_set(&vcpu->arch.xen.timer_pending, 0);
>   	vcpu->arch.xen.timer_expires = guest_abs;
>   
> -	if (delta_ns <= 0) {
> +	if (delta <= 0) {
>   		xen_timer_callback(&vcpu->arch.xen.timer);
>   	} else {
> -		ktime_t ktime_now = ktime_get();
>   		hrtimer_start(&vcpu->arch.xen.timer,
> -			      ktime_add_ns(ktime_now, delta_ns),
> +			      ktime_add_ns(kernel_now, delta),
>   			      HRTIMER_MODE_ABS_HARD);
>   	}
>   }
> @@ -923,8 +994,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
>   		/* Start the timer if the new value has a valid vector+expiry. */
>   		if (data->u.timer.port && data->u.timer.expires_ns)
>   			kvm_xen_start_timer(vcpu, data->u.timer.expires_ns,
> -					    data->u.timer.expires_ns -
> -					    get_kvmclock_ns(vcpu->kvm));
> +					    false);

There is no documented ordering requirement on setting 
KVM_XEN_VCPU_ATTR_TYPE_TIMER versus KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO or 
KVM_XEN_ATTR_TYPE_SHARED_INFO but kvm_xen_start_timer() now needs the 
vCPU's pvclock to be valid. Should actually starting the timer not be 
deferred until then? (Or simply add a check here and have the attribute 
setting fail if the pvclock is not valid).

   Paul

>   
>   		r = 0;
>   		break;
> @@ -1340,7 +1410,6 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
>   {
>   	struct vcpu_set_singleshot_timer oneshot;
>   	struct x86_exception e;
> -	s64 delta;
>   
>   	if (!kvm_xen_timer_enabled(vcpu))
>   		return false;
> @@ -1374,13 +1443,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
>   			return true;
>   		}
>   
> -		delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm);
> -		if ((oneshot.flags & VCPU_SSHOTTMR_future) && delta < 0) {
> -			*r = -ETIME;
> -			return true;
> -		}
> -
> -		kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta);
> +		kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false);
>   		*r = 0;
>   		return true;
>   
> @@ -1404,25 +1467,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout,
>   		return false;
>   
>   	if (timeout) {
> -		uint64_t guest_now = get_kvmclock_ns(vcpu->kvm);
> -		int64_t delta = timeout - guest_now;
> -
> -		/* Xen has a 'Linux workaround' in do_set_timer_op() which
> -		 * checks for negative absolute timeout values (caused by
> -		 * integer overflow), and for values about 13 days in the
> -		 * future (2^50ns) which would be caused by jiffies
> -		 * overflow. For those cases, it sets the timeout 100ms in
> -		 * the future (not *too* soon, since if a guest really did
> -		 * set a long timeout on purpose we don't want to keep
> -		 * churning CPU time by waking it up).
> -		 */
> -		if (unlikely((int64_t)timeout < 0 ||
> -			     (delta > 0 && (uint32_t) (delta >> 50) != 0))) {
> -			delta = 100 * NSEC_PER_MSEC;
> -			timeout = guest_now + delta;
> -		}
> -
> -		kvm_xen_start_timer(vcpu, timeout, delta);
> +		kvm_xen_start_timer(vcpu, timeout, true);
>   	} else {
>   		kvm_xen_stop_timer(vcpu);
>   	}
David Woodhouse Oct. 31, 2023, 11:42 a.m. UTC | #2
On 31 October 2023 10:42:42 GMT, Paul Durrant <xadimgnik@gmail.com> wrote:
>There is no documented ordering requirement on setting KVM_XEN_VCPU_ATTR_TYPE_TIMER versus KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO or KVM_XEN_ATTR_TYPE_SHARED_INFO but kvm_xen_start_timer() now needs the vCPU's pvclock to be valid. Should actually starting the timer not be deferred until then? (Or simply add a check here and have the attribute setting fail if the pvclock is not valid).


There are no such dependencies and I don't want there to be. That would be the *epitome* of what my "if it needs documenting, fix it first" mantra is intended to correct.

The fact that this broke on migration because the hv_clock isn't set up yet, as we saw in our overnight testing, is just a bug. In my tree I've fixed it thus:

index 63531173dad1..e3d2d63eef34 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -182,7 +182,7 @@ static void kvm_xen_start_timer(st
ruct kvm_vcpu *vcpu, u64 guest_abs,
         * the absolute CLOCK_MONOTONIC time at which
the timer should
         * fire.
         */
-       if (vcpu->kvm->arch.use_master_clock &&
+       if (vcpu->arch.hv_clock.version && vcpu->kvm->
arch.use_master_clock &&
            static_cpu_has(X86_FEATURE_CONSTANT_TSC))
{
                uint64_t host_tsc, guest_tsc;

@@ -206,9 +206,23 @@ static void kvm_xen_start_timer(s
truct kvm_vcpu *vcpu, u64 guest_abs,

                /* Calculate the guest kvmclock as the
 guest would do it. */
                guest_tsc = kvm_read_l1_tsc(vcpu, host
_tsc);
-               guest_now = __pvclock_read_cycles(&vcp
u->arch.hv_clock, guest_tsc);
+               guest_now = __pvclock_read_cycles(&vcp
u->arch.hv_clock,
+                                                 gues
t_tsc);
        } else {
-               /* Without CONSTANT_TSC, get_kvmclock_
ns() is the only option */
+               /*
+                * Without CONSTANT_TSC, get_kvmclock_
ns() is the only option.
+                *
+                * Also if the guest PV clock hasn't b
een set up yet, as is
+                * likely to be the case during migrat
ion when the vCPU has
+                * not been run yet. It would be possi
ble to calculate the
+                * scaling factors properly in that ca
se but there's not much
+                * point in doing so. The get_kvmclock
_ns() drift accumulates
+                * over time, so it's OK to use it at
startup. Besides, on
+                * migration there's going to be a lit
tle bit of skew in the
+                * precise moment at which timers fire
 anyway. Often they'll
+                * be in the "past" by the time the VM
 is running again after
+                * migration.
+                */
                guest_now = get_kvmclock_ns(vcpu->kvm)
;
                kernel_now = ktime_get();
        }
--
2.41.0

We *could* reset the timer when the vCPU starts to run and handles the KVM_REQ_CLOCK_UPDATE event, but I don't want to for two reasons.

Firstly, we just don't need that complexity. This approach is OK, as the newly-added comment says. And we do need to fix get_kvmclock_ns() anyway, so it should work fine. Most of this patch will still be useful as it uses a single TSC read and we *do* need to do that part even after all the kvmclock brokenness is fixed. But the complexity on KVM_REQ_CLOCK_UPDATE isn't needed in the long term.

Secondly, it's also wrong thing to do in the general case. Let's say KVM does its thing and snaps the kvmclock backwards in time on a KVM_REQ_CLOCK_UPDATE... do we really want to reinterpret existing timers against the new kvmclock? They were best left alone, I think. 



















Paul Durrant Oct. 31, 2023, 12:06 p.m. UTC | #3
On 31/10/2023 11:42, David Woodhouse wrote:
> 
> 
> On 31 October 2023 10:42:42 GMT, Paul Durrant <xadimgnik@gmail.com> wrote:
>> There is no documented ordering requirement on setting KVM_XEN_VCPU_ATTR_TYPE_TIMER versus KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO or KVM_XEN_ATTR_TYPE_SHARED_INFO but kvm_xen_start_timer() now needs the vCPU's pvclock to be valid. Should actually starting the timer not be deferred until then? (Or simply add a check here and have the attribute setting fail if the pvclock is not valid).
> 
> 
> There are no such dependencies and I don't want there to be. That would be the *epitome* of what my "if it needs documenting, fix it first" mantra is intended to correct.
> 
> The fact that this broke on migration because the hv_clock isn't set up yet, as we saw in our overnight testing, is just a bug. In my tree I've fixed it thus:
> 
> index 63531173dad1..e3d2d63eef34 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -182,7 +182,7 @@ static void kvm_xen_start_timer(st
> ruct kvm_vcpu *vcpu, u64 guest_abs,
>           * the absolute CLOCK_MONOTONIC time at which
> the timer should
>           * fire.
>           */
> -       if (vcpu->kvm->arch.use_master_clock &&
> +       if (vcpu->arch.hv_clock.version && vcpu->kvm->
> arch.use_master_clock &&
>              static_cpu_has(X86_FEATURE_CONSTANT_TSC))
> {
>                  uint64_t host_tsc, guest_tsc;
> 
> @@ -206,9 +206,23 @@ static void kvm_xen_start_timer(s
> truct kvm_vcpu *vcpu, u64 guest_abs,
> 
>                  /* Calculate the guest kvmclock as the
>   guest would do it. */
>                  guest_tsc = kvm_read_l1_tsc(vcpu, host
> _tsc);
> -               guest_now = __pvclock_read_cycles(&vcp
> u->arch.hv_clock, guest_tsc);
> +               guest_now = __pvclock_read_cycles(&vcp
> u->arch.hv_clock,
> +                                                 gues
> t_tsc);
>          } else {
> -               /* Without CONSTANT_TSC, get_kvmclock_
> ns() is the only option */
> +               /*
> +                * Without CONSTANT_TSC, get_kvmclock_
> ns() is the only option.
> +                *
> +                * Also if the guest PV clock hasn't b
> een set up yet, as is
> +                * likely to be the case during migrat
> ion when the vCPU has
> +                * not been run yet. It would be possi
> ble to calculate the
> +                * scaling factors properly in that ca
> se but there's not much
> +                * point in doing so. The get_kvmclock
> _ns() drift accumulates
> +                * over time, so it's OK to use it at
> startup. Besides, on
> +                * migration there's going to be a lit
> tle bit of skew in the
> +                * precise moment at which timers fire
>   anyway. Often they'll
> +                * be in the "past" by the time the VM
>   is running again after
> +                * migration.
> +                */
>                  guest_now = get_kvmclock_ns(vcpu->kvm)
> ;
>                  kernel_now = ktime_get();
>          }
> --
> 2.41.0
> 
> We *could* reset the timer when the vCPU starts to run and handles the KVM_REQ_CLOCK_UPDATE event, but I don't want to for two reasons.
> 
> Firstly, we just don't need that complexity. This approach is OK, as the newly-added comment says. And we do need to fix get_kvmclock_ns() anyway, so it should work fine. Most of this patch will still be useful as it uses a single TSC read and we *do* need to do that part even after all the kvmclock brokenness is fixed. But the complexity on KVM_REQ_CLOCK_UPDATE isn't needed in the long term.
> 
> Secondly, it's also wrong thing to do in the general case. Let's say KVM does its thing and snaps the kvmclock backwards in time on a KVM_REQ_CLOCK_UPDATE... do we really want to reinterpret existing timers against the new kvmclock? They were best left alone, I think.

Do we not want to do exactly that? If the master clock is changed, why 
would we not want to re-interpret the guest's idea of time? That update 
will be visible to the guest when it re-reads the PV clock source.

   Paul
David Woodhouse Oct. 31, 2023, 12:11 p.m. UTC | #4
On 31 October 2023 12:06:17 GMT, Paul Durrant <xadimgnik@gmail.com> wrote:
>On 31/10/2023 11:42, David Woodhouse wrote:
>> Secondly, it's also wrong thing to do in the general case. Let's say KVM does its thing and snaps the kvmclock backwards in time on a KVM_REQ_CLOCK_UPDATE... do we really want to reinterpret existing timers against the new kvmclock? They were best left alone, I think.
>
>Do we not want to do exactly that? If the master clock is changed, why would we not want to re-interpret the guest's idea of time? That update will be visible to the guest when it re-reads the PV clock source.

Well no, because the guest set that timer *before* we yanked the clock from under it, and probably wants it interpreted in the time scale which was in force at the time it was set.

But more to the point, KVM shouldn't be doing that! We need to *fix* the kvmclock brokenness, not design further band-aids around it.

As I said, this patch stands even *after* we fix kvmclock, because it handles the timer delta calculation from an single TSC read.

But overengineering a timer reset on KVM_REQ_CLOCK_UPDATE would not.
Paul Durrant Oct. 31, 2023, 12:22 p.m. UTC | #5
On 31/10/2023 12:11, David Woodhouse wrote:
> 
> 
> On 31 October 2023 12:06:17 GMT, Paul Durrant <xadimgnik@gmail.com> wrote:
>> On 31/10/2023 11:42, David Woodhouse wrote:
>>> Secondly, it's also wrong thing to do in the general case. Let's say KVM does its thing and snaps the kvmclock backwards in time on a KVM_REQ_CLOCK_UPDATE... do we really want to reinterpret existing timers against the new kvmclock? They were best left alone, I think.
>>
>> Do we not want to do exactly that? If the master clock is changed, why would we not want to re-interpret the guest's idea of time? That update will be visible to the guest when it re-reads the PV clock source.
> 
> Well no, because the guest set that timer *before* we yanked the clock from under it, and probably wants it interpreted in the time scale which was in force at the time it was set.
> 
> But more to the point, KVM shouldn't be doing that! We need to *fix* the kvmclock brokenness, not design further band-aids around it.

Ok, fair enough.

> 
> As I said, this patch stands even *after* we fix kvmclock, because it handles the timer delta calculation from an single TSC read.
> 
> But overengineering a timer reset on KVM_REQ_CLOCK_UPDATE would not.

I'm not sure what you intend to do to kvmlock, so not sure whether we'll 
still need the __pvclock_read_cycles(&vcpu->arch.hv_clock, guest_tsc) 
but this patch (with the extra check on validity of hv_clock) does fix 
the drift so...

Reviewed-by: Paul Durrant <paul@xen.org>
David Woodhouse Oct. 31, 2023, 4:34 p.m. UTC | #6
On Tue, 2023-10-31 at 12:22 +0000, Paul Durrant wrote:
> 
> > 
> > As I said, this patch stands even *after* we fix kvmclock, because
> > it handles the timer delta calculation from an single TSC read.
> > 
> > But overengineering a timer reset on KVM_REQ_CLOCK_UPDATE would not.
> 
> I'm not sure what you intend to do to kvmlock, so not sure whether we'll 
> still need the __pvclock_read_cycles(&vcpu->arch.hv_clock, guest_tsc)
> but this patch (with the extra check on validity of hv_clock) does fix 
> the drift so...
> 
> Reviewed-by: Paul Durrant <paul@xen.org>

Ta. And no, I'm not quite sure what I'm going to do with kvmclock for
the general case yet. The more I look at it, the more I realise how
broken it is.

Last week I thought it was just about the way KVM 'jumps' the kvmclock
and yanks it back to the host's CLOCK_MONOTONIC_RAW, but I thought KVM
at *least* managed to do it right between those times. But no, this
patch is addressing the fact that even *without* those clock jumps, KVM
doesn't manage to calculate the guest clock the same way that it tells
the guest to... and thus doesn't get the same results :)

I think it involves get_kvmclock_ns() using the frequency given in the
KVM-wide (not per-vCPU) KVM_SET_TSC_KHZ ioctl, and scaling via that to
get the guest clock. That should match, without having to have a
specific vCPU's hv_clock to play with. And maybe we can also have a
get_kvmclock_ns_at() which takes a host TSC value, and the timer code
from this patch can use that instead of using __pvclock_read_cycles()
directly.

That's probably the easy part. Fixing the 'resync' to
CLOCK_MONOTONIC_RAW, while keeping things working for the now-
considered-pathological !CONSTANT_TSC case, will be slightly more fun.
As well as suspend etc. in the CONSTANT_TSC case, of course.

And replacing that stupid KVM_CLOCK_REALTIME with something that uses
CLOCK_TAI. Or maybe just making it export the tai_offset at the same
moment?
David Woodhouse Oct. 31, 2023, 7:45 p.m. UTC | #7
On Mon, 2023-10-30 at 15:50 +0000, David Woodhouse wrote:
> 
> +static int do_monotonic(s64 *t, u64 *tsc_timestamp)
> +{
> +	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> +	unsigned long seq;
> +	int mode;
> +	u64 ns;
> +
> +	do {
> +		seq = read_seqcount_begin(&gtod->seq);
> +		ns = gtod->clock.base_cycles;
> +		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
> +		ns >>= gtod->clock.shift;
> +		ns += ktime_to_ns(ktime_add(gtod->clock.offset,
> gtod->offs_boot));
> +	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
> +	*t = ns;
> +
> +	return mode;
> +}
> +

Hrm, that's basically cargo-culted from do_monotonic_raw() immediately
above it. Should it be adding gtod->offs_boot?

Empirically the answer would appear to be 'no'. When gtod->offs_boot is
non-zero, I see kvm_get_monotonic_and_clockread() returning values
which are precisely that far in advance of what ktime_get() reports.
David Woodhouse Oct. 31, 2023, 10:38 p.m. UTC | #8
On Tue, 2023-10-31 at 19:45 +0000, David Woodhouse wrote:
> On Mon, 2023-10-30 at 15:50 +0000, David Woodhouse wrote:
> > 
> > +static int do_monotonic(s64 *t, u64 *tsc_timestamp)
> > +{
> > +	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> > +	unsigned long seq;
> > +	int mode;
> > +	u64 ns;
> > +
> > +	do {
> > +		seq = read_seqcount_begin(&gtod->seq);
> > +		ns = gtod->clock.base_cycles;
> > +		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
> > +		ns >>= gtod->clock.shift;
> > +		ns += ktime_to_ns(ktime_add(gtod->clock.offset, gtod->offs_boot));
> > +	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
> > +	*t = ns;
> > +
> > +	return mode;
> > +}
> > +
> 
> Hrm, that's basically cargo-culted from do_monotonic_raw() immediately
> above it. Should it be adding gtod->offs_boot?
> 
> Empirically the answer would appear to be 'no'. When gtod->offs_boot is
> non-zero, I see kvm_get_monotonic_and_clockread() returning values
> which are precisely that far in advance of what ktime_get() reports.


.... because the do_monotonic_raw() function, despite the simple
clarity of its name... doesn't actually return the CLOCK_MONOTONIC_RAW
time. Of course it doesn't. Why would a function with that name return
the MONOTONIC_RAW clock?

It actually returns the same as get_kvmclock_base_ns(), which is

	/* Count up from boot time, but with the frequency of the raw clock.  */
	return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));

I feel that Grey's Law is starting to apply to this clock stuff. This
is starting to be indistinguishable from malice ;)
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 41cce5031126..aeede83d65dc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2863,6 +2863,25 @@  static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
 	return mode;
 }
 
+static int do_monotonic(s64 *t, u64 *tsc_timestamp)
+{
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+	unsigned long seq;
+	int mode;
+	u64 ns;
+
+	do {
+		seq = read_seqcount_begin(&gtod->seq);
+		ns = gtod->clock.base_cycles;
+		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
+		ns >>= gtod->clock.shift;
+		ns += ktime_to_ns(ktime_add(gtod->clock.offset, gtod->offs_boot));
+	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
+	*t = ns;
+
+	return mode;
+}
+
 static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
 {
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
@@ -2895,6 +2914,17 @@  static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
 						      tsc_timestamp));
 }
 
+/* returns true if host is using TSC based clocksource */
+bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
+{
+	/* checked again under seqlock below */
+	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
+		return false;
+
+	return gtod_is_based_on_tsc(do_monotonic(kernel_ns,
+						 tsc_timestamp));
+}
+
 /* returns true if host is using TSC based clocksource */
 static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
 					   u64 *tsc_timestamp)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 1e7be1f6ab29..c08c6f729965 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -293,6 +293,7 @@  static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
 void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
 
 u64 get_kvmclock_ns(struct kvm *kvm);
+bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);
 
 int kvm_read_guest_virt(struct kvm_vcpu *vcpu,
 	gva_t addr, void *val, unsigned int bytes,
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 0ea6016ad132..00a1e924a717 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -24,6 +24,7 @@ 
 #include <xen/interface/sched.h>
 
 #include <asm/xen/cpuid.h>
+#include <asm/pvclock.h>
 
 #include "cpuid.h"
 #include "trace.h"
@@ -144,17 +145,87 @@  static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
+static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs,
+				bool linux_wa)
 {
+	uint64_t guest_now;
+	int64_t kernel_now, delta;
+
+	 /*
+	 * The guest provides the requested timeout in absolute nanoseconds
+	 * of the KVM clock — as *it* sees it, based on the scaled TSC and
+	 * the pvclock information provided by KVM.
+	 *
+	 * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW
+	 * so use CLOCK_MONOTONIC. In the timescales covered by timers, the
+	 * difference won't matter much as there is no cumulative effect.
+	 *
+	 * Calculate the time for some arbitrary point in time around "now"
+	 * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the
+	 * delta between the kvmclock "now" value and the guest's requested
+	 * timeout, apply the "Linux workaround" described below, and add
+	 * the resulting delta to the CLOCK_MONOTONIC "now" value, to get
+	 * the absolute CLOCK_MONOTONIC time at which the timer should
+	 * fire.
+	 */
+	if (vcpu->kvm->arch.use_master_clock &&
+	    static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+		uint64_t host_tsc, guest_tsc;
+
+		if (!IS_ENABLED(CONFIG_64BIT) ||
+		    !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) {
+			/*
+			 * Don't fall back to get_kvmclock_ns() because it's
+			 * broken; it has a systemic error in its results
+			 * because it scales directly from host TSC to
+			 * nanoseconds, and doesn't scale first to guest TSC
+			 * and then* to nanoseconds as the guest does.
+			 *
+			 * There is a small error introduced here because time
+			 * continues to elapse between the ktime_get() and the
+			 * subsequent rdtsc(). But not the systemic drift due
+			 * to get_kvmclock_ns().
+			 */
+			kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */
+			host_tsc = rdtsc();
+		}
+
+		/* Calculate the guest kvmclock as the guest would do it. */
+		guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
+		guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock, guest_tsc);
+	} else {
+		/* Without CONSTANT_TSC, get_kvmclock_ns() is the only option */
+		guest_now = get_kvmclock_ns(vcpu->kvm);
+		kernel_now = ktime_get();
+	}
+
+	delta = guest_abs - guest_now;
+
+	/* Xen has a 'Linux workaround' in do_set_timer_op() which
+	 * checks for negative absolute timeout values (caused by
+	 * integer overflow), and for values about 13 days in the
+	 * future (2^50ns) which would be caused by jiffies
+	 * overflow. For those cases, it sets the timeout 100ms in
+	 * the future (not *too* soon, since if a guest really did
+	 * set a long timeout on purpose we don't want to keep
+	 * churning CPU time by waking it up).
+	 */
+	if (linux_wa) {
+		if ((unlikely((int64_t)guest_abs < 0 ||
+			      (delta > 0 && (uint32_t) (delta >> 50) != 0)))) {
+			delta = 100 * NSEC_PER_MSEC;
+			guest_abs = guest_now + delta;
+		}
+	}
+
 	atomic_set(&vcpu->arch.xen.timer_pending, 0);
 	vcpu->arch.xen.timer_expires = guest_abs;
 
-	if (delta_ns <= 0) {
+	if (delta <= 0) {
 		xen_timer_callback(&vcpu->arch.xen.timer);
 	} else {
-		ktime_t ktime_now = ktime_get();
 		hrtimer_start(&vcpu->arch.xen.timer,
-			      ktime_add_ns(ktime_now, delta_ns),
+			      ktime_add_ns(kernel_now, delta),
 			      HRTIMER_MODE_ABS_HARD);
 	}
 }
@@ -923,8 +994,7 @@  int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 		/* Start the timer if the new value has a valid vector+expiry. */
 		if (data->u.timer.port && data->u.timer.expires_ns)
 			kvm_xen_start_timer(vcpu, data->u.timer.expires_ns,
-					    data->u.timer.expires_ns -
-					    get_kvmclock_ns(vcpu->kvm));
+					    false);
 
 		r = 0;
 		break;
@@ -1340,7 +1410,6 @@  static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
 {
 	struct vcpu_set_singleshot_timer oneshot;
 	struct x86_exception e;
-	s64 delta;
 
 	if (!kvm_xen_timer_enabled(vcpu))
 		return false;
@@ -1374,13 +1443,7 @@  static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
 			return true;
 		}
 
-		delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm);
-		if ((oneshot.flags & VCPU_SSHOTTMR_future) && delta < 0) {
-			*r = -ETIME;
-			return true;
-		}
-
-		kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta);
+		kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false);
 		*r = 0;
 		return true;
 
@@ -1404,25 +1467,7 @@  static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout,
 		return false;
 
 	if (timeout) {
-		uint64_t guest_now = get_kvmclock_ns(vcpu->kvm);
-		int64_t delta = timeout - guest_now;
-
-		/* Xen has a 'Linux workaround' in do_set_timer_op() which
-		 * checks for negative absolute timeout values (caused by
-		 * integer overflow), and for values about 13 days in the
-		 * future (2^50ns) which would be caused by jiffies
-		 * overflow. For those cases, it sets the timeout 100ms in
-		 * the future (not *too* soon, since if a guest really did
-		 * set a long timeout on purpose we don't want to keep
-		 * churning CPU time by waking it up).
-		 */
-		if (unlikely((int64_t)timeout < 0 ||
-			     (delta > 0 && (uint32_t) (delta >> 50) != 0))) {
-			delta = 100 * NSEC_PER_MSEC;
-			timeout = guest_now + delta;
-		}
-
-		kvm_xen_start_timer(vcpu, timeout, delta);
+		kvm_xen_start_timer(vcpu, timeout, true);
 	} else {
 		kvm_xen_stop_timer(vcpu);
 	}