diff mbox series

[v2] KVM: x86/xen: improve accuracy of Xen timers

Message ID 74f32bfae7243a78d0e74b1ba3a2d1ea4a4a7518.camel@infradead.org (mailing list archive)
State New, archived
Headers show
Series [v2] KVM: x86/xen: improve accuracy of Xen timers | expand

Commit Message

David Woodhouse Oct. 31, 2023, 11:13 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

A test program such as http://david.woodhou.se/timerlat.c confirms user
reports that timers are increasingly inaccurate as the lifetime of a
guest increases. Reporting the actual delay observed when asking for
100µs of sleep, it starts off OK on a newly-launched guest but gets
worse over time, giving incorrect sleep times:

root@ip-10-0-193-21:~# ./timerlat -c -n 5
00000000 latency 103243/100000 (3.2430%)
00000001 latency 103243/100000 (3.2430%)
00000002 latency 103242/100000 (3.2420%)
00000003 latency 103245/100000 (3.2450%)
00000004 latency 103245/100000 (3.2450%)

The biggest problem is that get_kvmclock_ns() returns inaccurate values
when the guest TSC is scaled. The guest sees a TSC value scaled from the
host TSC by a mul/shift conversion (hopefully done in hardware). The
guest then converts that guest TSC value into nanoseconds using the
mul/shift conversion given to it by the KVM pvclock information.

But get_kvmclock_ns() performs only a single conversion directly from
host TSC to nanoseconds, giving a different result. A test program at
http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
over a day.

It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
that. The actual guest hv_clock is per-CPU, and *theoretically* each
vCPU could be running at a *different* frequency. But this patch is
needed anyway because...

The other issue with Xen timers was that the code would snapshot the
host CLOCK_MONOTONIC at some point in time, and then... after a few
interrupts may have occurred, some preemption perhaps... would also read
the guest's kvmclock. Then it would proceed under the false assumption
that those two happened at the *same* time. Any time which *actually*
elapsed between reading the two clocks was introduced as inaccuracies
in the time at which the timer fired.

Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
host TSC just *once*, then use the returned TSC value to calculate the
kvmclock (making sure to do that the way the guest would instead of
making the same mistake get_kvmclock_ns() does).

Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
timers still have to use CLOCK_MONOTONIC. In practice the difference
between the two won't matter over the timescales involved, as the
*absolute* values don't matter; just the delta.

This does mean a new variant of kvm_get_time_and_clockread() is needed;
called kvm_get_monotonic_and_clockread() because that's what it does.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>

---
v2: 
  • Fall back to get_kvmclock_ns() if vcpu-arch.hv_clock isn't set up
    yet, with a big comment explaining why that's actually OK.
  • Fix do_monotonic() *not* to add the boot time offset.
  • Rename do_monotonic_raw() → do_kvmclock_base() and add a comment
    to make it clear that it *does* add the boot time offset. That
    was just left as a bear trap for the unwary developer, wasn't it?

 arch/x86/kvm/x86.c |  61 +++++++++++++++++++++--
 arch/x86/kvm/x86.h |   1 +
 arch/x86/kvm/xen.c | 121 ++++++++++++++++++++++++++++++++++-----------
 3 files changed, 149 insertions(+), 34 deletions(-)

Comments

Dongli Zhang Nov. 7, 2023, 1:44 a.m. UTC | #1
Hi David,

On 10/31/23 16:13, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> A test program such as http://david.woodhou.se/timerlat.c confirms user
> reports that timers are increasingly inaccurate as the lifetime of a
> guest increases. Reporting the actual delay observed when asking for
> 100µs of sleep, it starts off OK on a newly-launched guest but gets
> worse over time, giving incorrect sleep times:
> 
> root@ip-10-0-193-21:~# ./timerlat -c -n 5
> 00000000 latency 103243/100000 (3.2430%)
> 00000001 latency 103243/100000 (3.2430%)
> 00000002 latency 103242/100000 (3.2420%)
> 00000003 latency 103245/100000 (3.2450%)
> 00000004 latency 103245/100000 (3.2450%)
> 
> The biggest problem is that get_kvmclock_ns() returns inaccurate values
> when the guest TSC is scaled. The guest sees a TSC value scaled from the
> host TSC by a mul/shift conversion (hopefully done in hardware). The
> guest then converts that guest TSC value into nanoseconds using the
> mul/shift conversion given to it by the KVM pvclock information.
> 
> But get_kvmclock_ns() performs only a single conversion directly from
> host TSC to nanoseconds, giving a different result. A test program at
> http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
> over a day.
> 
> It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
> that. The actual guest hv_clock is per-CPU, and *theoretically* each
> vCPU could be running at a *different* frequency. But this patch is
> needed anyway because...
> 
> The other issue with Xen timers was that the code would snapshot the
> host CLOCK_MONOTONIC at some point in time, and then... after a few
> interrupts may have occurred, some preemption perhaps... would also read
> the guest's kvmclock. Then it would proceed under the false assumption
> that those two happened at the *same* time. Any time which *actually*
> elapsed between reading the two clocks was introduced as inaccuracies
> in the time at which the timer fired.
> 
> Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
> host TSC just *once*, then use the returned TSC value to calculate the
> kvmclock (making sure to do that the way the guest would instead of
> making the same mistake get_kvmclock_ns() does).
> 
> Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
> timers still have to use CLOCK_MONOTONIC. In practice the difference
> between the two won't matter over the timescales involved, as the
> *absolute* values don't matter; just the delta.
> 
> This does mean a new variant of kvm_get_time_and_clockread() is needed;
> called kvm_get_monotonic_and_clockread() because that's what it does.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Reviewed-by: Paul Durrant <paul@xen.org>
> 
> ---
> v2: 
>   • Fall back to get_kvmclock_ns() if vcpu-arch.hv_clock isn't set up
>     yet, with a big comment explaining why that's actually OK.
>   • Fix do_monotonic() *not* to add the boot time offset.
>   • Rename do_monotonic_raw() → do_kvmclock_base() and add a comment
>     to make it clear that it *does* add the boot time offset. That
>     was just left as a bear trap for the unwary developer, wasn't it?
> 
>  arch/x86/kvm/x86.c |  61 +++++++++++++++++++++--
>  arch/x86/kvm/x86.h |   1 +
>  arch/x86/kvm/xen.c | 121 ++++++++++++++++++++++++++++++++++-----------
>  3 files changed, 149 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6eaab714d90a..e479637af42c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2844,7 +2844,11 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
>  	return v * clock->mult;
>  }
>  
> -static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
> +/*
> + * As with get_kvmclock_base_ns(), this counts from boot time, at the
> + * frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot).
> + */
> +static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
>  {
>  	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
>  	unsigned long seq;
> @@ -2863,6 +2867,29 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
>  	return mode;
>  }
>  
> +/*
> + * This calculates CLOCK_MONOTONIC at the time of the TSC snapshot, with
> + * no boot time offset.
> + */
> +static int do_monotonic(s64 *t, u64 *tsc_timestamp)
> +{
> +	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> +	unsigned long seq;
> +	int mode;
> +	u64 ns;
> +
> +	do {
> +		seq = read_seqcount_begin(&gtod->seq);
> +		ns = gtod->clock.base_cycles;
> +		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
> +		ns >>= gtod->clock.shift;
> +		ns += ktime_to_ns(gtod->clock.offset);
> +	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
> +	*t = ns;
> +
> +	return mode;
> +}
> +
>  static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
>  {
>  	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> @@ -2884,18 +2911,42 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
>  	return mode;
>  }
>  
> -/* returns true if host is using TSC based clocksource */
> +/*
> + * Calculates the kvmclock_base_ns (CLOCK_MONOTONIC_RAW + boot time) and
> + * reports the TSC value from which it do so. Returns true if host is
> + * using TSC based clocksource.
> + */
>  static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
>  {
>  	/* checked again under seqlock below */
>  	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
>  		return false;
>  
> -	return gtod_is_based_on_tsc(do_monotonic_raw(kernel_ns,
> -						      tsc_timestamp));
> +	return gtod_is_based_on_tsc(do_kvmclock_base(kernel_ns,
> +						     tsc_timestamp));
>  }
>  
> -/* returns true if host is using TSC based clocksource */
> +/*
> + * Calculates CLOCK_MONOTONIC and reports the TSC value from which it did
> + * so. Returns true if host is using TSC based clocksource.
> + */
> +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
> +{
> +	/* checked again under seqlock below */
> +	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
> +		return false;
> +
> +	return gtod_is_based_on_tsc(do_monotonic(kernel_ns,
> +						 tsc_timestamp));
> +}
> +
> +/*
> + * Calculates CLOCK_REALTIME and reports the TSC value from which it did
> + * so. Returns true if host is using TSC based clocksource.
> + *
> + * DO NOT USE this for anything related to migration. You want CLOCK_TAI
> + * for that.
> + */
>  static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
>  					   u64 *tsc_timestamp)
>  {
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 1e7be1f6ab29..c08c6f729965 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -293,6 +293,7 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
>  void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
>  
>  u64 get_kvmclock_ns(struct kvm *kvm);
> +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);
>  
>  int kvm_read_guest_virt(struct kvm_vcpu *vcpu,
>  	gva_t addr, void *val, unsigned int bytes,
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 751d9a984668..e3d2d63eef34 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -24,6 +24,7 @@
>  #include <xen/interface/sched.h>
>  
>  #include <asm/xen/cpuid.h>
> +#include <asm/pvclock.h>
>  
>  #include "cpuid.h"
>  #include "trace.h"
> @@ -158,8 +159,93 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
>  	return HRTIMER_NORESTART;
>  }
>  
> -static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
> +static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs,
> +				bool linux_wa)
>  {
> +	uint64_t guest_now;
> +	int64_t kernel_now, delta;
> +
> +	/*
> +	 * The guest provides the requested timeout in absolute nanoseconds
> +	 * of the KVM clock — as *it* sees it, based on the scaled TSC and
> +	 * the pvclock information provided by KVM.
> +	 *
> +	 * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW
> +	 * so use CLOCK_MONOTONIC. In the timescales covered by timers, the
> +	 * difference won't matter much as there is no cumulative effect.
> +	 *
> +	 * Calculate the time for some arbitrary point in time around "now"
> +	 * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the
> +	 * delta between the kvmclock "now" value and the guest's requested
> +	 * timeout, apply the "Linux workaround" described below, and add
> +	 * the resulting delta to the CLOCK_MONOTONIC "now" value, to get
> +	 * the absolute CLOCK_MONOTONIC time at which the timer should
> +	 * fire.
> +	 */
> +	if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock &&
> +	    static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {

If there any reason to use both vcpu->kvm->arch.use_master_clock and
X86_FEATURE_CONSTANT_TSC?

I think even __get_kvmclock() would not require both cases at the same time?

 3071         if (ka->use_master_clock &&
 3072             (static_cpu_has(X86_FEATURE_CONSTANT_TSC) ||
__this_cpu_read(cpu_tsc_khz))) {

> +		uint64_t host_tsc, guest_tsc;
> +
> +		if (!IS_ENABLED(CONFIG_64BIT) ||
> +		    !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) {
> +			/*
> +			 * Don't fall back to get_kvmclock_ns() because it's
> +			 * broken; it has a systemic error in its results
> +			 * because it scales directly from host TSC to
> +			 * nanoseconds, and doesn't scale first to guest TSC
> +			 * and then* to nanoseconds as the guest does.
> +			 *
> +			 * There is a small error introduced here because time
> +			 * continues to elapse between the ktime_get() and the
> +			 * subsequent rdtsc(). But not the systemic drift due
> +			 * to get_kvmclock_ns().
> +			 */
> +			kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */
> +			host_tsc = rdtsc();
> +		}
> +
> +		/* Calculate the guest kvmclock as the guest would do it. */
> +		guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> +		guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock,
> +						  guest_tsc);
> +	} else {
> +		/*
> +		 * Without CONSTANT_TSC, get_kvmclock_ns() is the only option.
> +		 *
> +		 * Also if the guest PV clock hasn't been set up yet, as is
> +		 * likely to be the case during migration when the vCPU has
> +		 * not been run yet. It would be possible to calculate the
> +		 * scaling factors properly in that case but there's not much
> +		 * point in doing so. The get_kvmclock_ns() drift accumulates
> +		 * over time, so it's OK to use it at startup. Besides, on
> +		 * migration there's going to be a little bit of skew in the
> +		 * precise moment at which timers fire anyway. Often they'll
> +		 * be in the "past" by the time the VM is running again after
> +		 * migration.
> +		 */
> +		guest_now = get_kvmclock_ns(vcpu->kvm);
> +		kernel_now = ktime_get();

1. Can I assume the issue is still there if we fall into the "else" case? That
is, the increasing inaccuracy as the VM has been up for longer and longer time?

If that is the case, which may be better?

(1) get_kvmclock_ns(), or
(2) always get_kvmclock_base_ns() + ka->kvmclock_offset, when pvclock is not
enabled, regardless whether master clock is used. At least, the inaccurary is
not going to increase over guest time?


2. I see 3 scenarios here:

(1) vcpu->arch.hv_clock.version and master clock is used.

In this case, the bugfix looks good.

(2) The master clock is used. However, pv clock is not enabled.

In this case, the bug is not resolved? ... even the master clock is used.

(3) The master clock is not used.

We fall into get_kvmclock_base_ns() + ka->kvmclock_offset. The behavior is not
changed. This looks good.


Just from my own point: as this patch involves relatively complex changes, I
would suggest resolve the issue, but not use a temporary solution :)

(I see you mentioned that you will be back with get_kvmclock_ns())


Based on your bug fix, I see the below cases:

If master clock is not used:
    get_kvmclock_base_ns() + ka->kvmclock_offset

If master clock is used:
    If pvclock is enabled:
        use the &vcpu->arch.hv_clock to get current guest time
    Else
        create a temporary hv_clock, based on masterclock.


Regarding the temporary solution, I would suggest create a new API to
encapsulate and fulfill above scenarios, if we are not going to touch
__get_kvmclock() at this time point.


Thank you very much!

Dongli Zhang


> +	}
> +
> +	delta = guest_abs - guest_now;
> +
> +	/* Xen has a 'Linux workaround' in do_set_timer_op() which
> +	 * checks for negative absolute timeout values (caused by
> +	 * integer overflow), and for values about 13 days in the
> +	 * future (2^50ns) which would be caused by jiffies
> +	 * overflow. For those cases, it sets the timeout 100ms in
> +	 * the future (not *too* soon, since if a guest really did
> +	 * set a long timeout on purpose we don't want to keep
> +	 * churning CPU time by waking it up).
> +	 */
> +	if (linux_wa) {
> +		if ((unlikely((int64_t)guest_abs < 0 ||
> +			      (delta > 0 && (uint32_t) (delta >> 50) != 0)))) {
> +			delta = 100 * NSEC_PER_MSEC;
> +			guest_abs = guest_now + delta;
> +		}
> +	}
> +
>  	/*
>  	 * Avoid races with the old timer firing. Checking timer_expires
>  	 * to avoid calling hrtimer_cancel() will only have false positives
> @@ -171,12 +257,11 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_
>  	atomic_set(&vcpu->arch.xen.timer_pending, 0);
>  	vcpu->arch.xen.timer_expires = guest_abs;
>  
> -	if (delta_ns <= 0) {
> +	if (delta <= 0) {
>  		xen_timer_callback(&vcpu->arch.xen.timer);
>  	} else {
> -		ktime_t ktime_now = ktime_get();
>  		hrtimer_start(&vcpu->arch.xen.timer,
> -			      ktime_add_ns(ktime_now, delta_ns),
> +			      ktime_add_ns(kernel_now, delta),
>  			      HRTIMER_MODE_ABS_HARD);
>  	}
>  }
> @@ -945,8 +1030,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
>  		/* Start the timer if the new value has a valid vector+expiry. */
>  		if (data->u.timer.port && data->u.timer.expires_ns)
>  			kvm_xen_start_timer(vcpu, data->u.timer.expires_ns,
> -					    data->u.timer.expires_ns -
> -					    get_kvmclock_ns(vcpu->kvm));
> +					    false);
>  
>  		r = 0;
>  		break;
> @@ -1389,7 +1473,6 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
>  {
>  	struct vcpu_set_singleshot_timer oneshot;
>  	struct x86_exception e;
> -	s64 delta;
>  
>  	if (!kvm_xen_timer_enabled(vcpu))
>  		return false;
> @@ -1423,9 +1506,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
>  			return true;
>  		}
>  
> -		/* A delta <= 0 results in an immediate callback, which is what we want */
> -		delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm);
> -		kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta);
> +		kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false);
>  		*r = 0;
>  		return true;
>  
> @@ -1449,25 +1530,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout,
>  		return false;
>  
>  	if (timeout) {
> -		uint64_t guest_now = get_kvmclock_ns(vcpu->kvm);
> -		int64_t delta = timeout - guest_now;
> -
> -		/* Xen has a 'Linux workaround' in do_set_timer_op() which
> -		 * checks for negative absolute timeout values (caused by
> -		 * integer overflow), and for values about 13 days in the
> -		 * future (2^50ns) which would be caused by jiffies
> -		 * overflow. For those cases, it sets the timeout 100ms in
> -		 * the future (not *too* soon, since if a guest really did
> -		 * set a long timeout on purpose we don't want to keep
> -		 * churning CPU time by waking it up).
> -		 */
> -		if (unlikely((int64_t)timeout < 0 ||
> -			     (delta > 0 && (uint32_t) (delta >> 50) != 0))) {
> -			delta = 100 * NSEC_PER_MSEC;
> -			timeout = guest_now + delta;
> -		}
> -
> -		kvm_xen_start_timer(vcpu, timeout, delta);
> +		kvm_xen_start_timer(vcpu, timeout, true);
>  	} else {
>  		kvm_xen_stop_timer(vcpu);
>  	}
David Woodhouse Nov. 7, 2023, 8:17 a.m. UTC | #2
On Mon, 2023-11-06 at 17:44 -0800, Dongli Zhang wrote:
> > +       if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock &&
> > +           static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> 
> If there any reason to use both vcpu->kvm->arch.use_master_clock and
> X86_FEATURE_CONSTANT_TSC?

Er, paranoia? I'll recheck.

> I think even __get_kvmclock() would not require both cases at the same time?
> 
>  3071         if (ka->use_master_clock &&
>  3072             (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
> 

But it does. That requires ka->use_master_clock (line 3071) AND that we
know the current CPU's TSC frequency (line 3072).

My code insists on the CONSTANT_TSC form of "knowing the current CPU's
TSC frequency" because even with a get_cpu(), it's not clear the guest
*was* running on this vCPU when it did its calculations. So I don't
want to go anywhere near the !CONSTANT_TSC case; it can use the
fallback.


> > +       } else {
> > +               /*
> > +                * Without CONSTANT_TSC, get_kvmclock_ns() is the only option.
> > +                *
> > +                * Also if the guest PV clock hasn't been set up yet, as is
> > +                * likely to be the case during migration when the vCPU has
> > +                * not been run yet. It would be possible to calculate the
> > +                * scaling factors properly in that case but there's not much
> > +                * point in doing so. The get_kvmclock_ns() drift accumulates
> > +                * over time, so it's OK to use it at startup. Besides, on
> > +                * migration there's going to be a little bit of skew in the
> > +                * precise moment at which timers fire anyway. Often they'll
> > +                * be in the "past" by the time the VM is running again after
> > +                * migration.
> > +                */
> > +               guest_now = get_kvmclock_ns(vcpu->kvm);
> > +               kernel_now = ktime_get();
> 
> 1. Can I assume the issue is still there if we fall into the "else" case? That
> is, the increasing inaccuracy as the VM has been up for longer and longer time?
> 
> If that is the case, which may be better?
> 
> (1) get_kvmclock_ns(), or
> (2) always get_kvmclock_base_ns() + ka->kvmclock_offset, when pvclock is not
> enabled, regardless whether master clock is used. At least, the inaccurary is
> not going to increase over guest time?

No, those are both wrong, and drifting further away over time. They are
each *differently* wrong, which is why periodically clamping (1) to (2)
is also broken, as previously discussed. I know you've got a patch to
do that clamping more *often* which would slightly reduce the pain
because the kvmclock wouldn't jump backwards so *far* each time... but
it's still wrong to do so at all (in either direction).

> 
> 2. I see 3 scenarios here:
> 
> (1) vcpu->arch.hv_clock.version and master clock is used.
> 
> In this case, the bugfix looks good.
> 
> (2) The master clock is used. However, pv clock is not enabled.
> 
> In this case, the bug is not resolved? ... even the master clock is used.

Under Xen the PV clock is always enabled. It's in the vcpu_info
structure which is required for any kind of event channel setup.

> 
> (3) The master clock is not used.
> 
> We fall into get_kvmclock_base_ns() + ka->kvmclock_offset. The behavior is not
> changed. This looks good.
> 
> 
> Just from my own point: as this patch involves relatively complex changes, I
> would suggest resolve the issue, but not use a temporary solution :)
> 

This is the conversation I had with Paul on Tuesday, when he wanted me
to fix up this "(3) / behaviour is not changed" case. And yes, I argued
that we *don't* want a temporary solution for this case. Because yes:

> (I see you mentioned that you will be back with get_kvmclock_ns())

We need to fix get_kvmclock_ns() anyway. The systemic drift *and* the
fact that we periodically clamp it to a different clock and make it
jump. I was working on the former and have something half-done but was
preempted by the realisation that the QEMU soft freeze is today, and I
needed to flush my QEMU patch queue.

But even once we fix get_kvmclock_ns(), *this* patch stands. Because it
*also* addresses the "now" problem, where we get the time by one clock
... and then some time passes ... and we get the time by another clock,
and subtract one from the other as if they were the *same* time.

Using kvm_get_monotonic_and_clockread() gives us a single TSC read
corresponding to the CLOCK_MONOTONIC time, from which we can calculate
the kvmclock time. We just *happen* to calculate it correctly here,
unlike anywhere else in KVM.

> Based on your bug fix, I see the below cases:
> 
> If master clock is not used:
>     get_kvmclock_base_ns() + ka->kvmclock_offset
> 
> If master clock is used:
>     If pvclock is enabled:
>         use the &vcpu->arch.hv_clock to get current guest time
>     Else
>         create a temporary hv_clock, based on masterclock.

I don't want to do that last 'else' clause yet, because that feels like
a temporary workaround. It should be enough to call get_kvmclock_ns(),
once we fix it.
Dongli Zhang Nov. 7, 2023, 11:07 p.m. UTC | #3
Hi David,

On 11/7/23 00:17, David Woodhouse wrote:
> On Mon, 2023-11-06 at 17:44 -0800, Dongli Zhang wrote:
>>> +       if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock &&
>>> +           static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
>>
>> If there any reason to use both vcpu->kvm->arch.use_master_clock and
>> X86_FEATURE_CONSTANT_TSC?
> 
> Er, paranoia? I'll recheck.
> 
>> I think even __get_kvmclock() would not require both cases at the same time?
>>
>>  3071         if (ka->use_master_clock &&
>>  3072             (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
>>
> 
> But it does. That requires ka->use_master_clock (line 3071) AND that we
> know the current CPU's TSC frequency (line 3072).
> 
> My code insists on the CONSTANT_TSC form of "knowing the current CPU's
> TSC frequency" because even with a get_cpu(), it's not clear the guest
> *was* running on this vCPU when it did its calculations. So I don't
> want to go anywhere near the !CONSTANT_TSC case; it can use the
> fallback.
> 
> 
>>> +       } else {
>>> +               /*
>>> +                * Without CONSTANT_TSC, get_kvmclock_ns() is the only option.
>>> +                *
>>> +                * Also if the guest PV clock hasn't been set up yet, as is
>>> +                * likely to be the case during migration when the vCPU has
>>> +                * not been run yet. It would be possible to calculate the
>>> +                * scaling factors properly in that case but there's not much
>>> +                * point in doing so. The get_kvmclock_ns() drift accumulates
>>> +                * over time, so it's OK to use it at startup. Besides, on
>>> +                * migration there's going to be a little bit of skew in the
>>> +                * precise moment at which timers fire anyway. Often they'll
>>> +                * be in the "past" by the time the VM is running again after
>>> +                * migration.
>>> +                */
>>> +               guest_now = get_kvmclock_ns(vcpu->kvm);
>>> +               kernel_now = ktime_get();
>>
>> 1. Can I assume the issue is still there if we fall into the "else" case? That
>> is, the increasing inaccuracy as the VM has been up for longer and longer time?
>>
>> If that is the case, which may be better?
>>
>> (1) get_kvmclock_ns(), or
>> (2) always get_kvmclock_base_ns() + ka->kvmclock_offset, when pvclock is not
>> enabled, regardless whether master clock is used. At least, the inaccurary is
>> not going to increase over guest time?
> 
> No, those are both wrong, and drifting further away over time. They are
> each *differently* wrong, which is why periodically clamping (1) to (2)
> is also broken, as previously discussed. I know you've got a patch to
> do that clamping more *often* which would slightly reduce the pain
> because the kvmclock wouldn't jump backwards so *far* each time... but
> it's still wrong to do so at all (in either direction).
> 
>>
>> 2. I see 3 scenarios here:
>>
>> (1) vcpu->arch.hv_clock.version and master clock is used.
>>
>> In this case, the bugfix looks good.
>>
>> (2) The master clock is used. However, pv clock is not enabled.
>>
>> In this case, the bug is not resolved? ... even the master clock is used.
> 
> Under Xen the PV clock is always enabled. It's in the vcpu_info
> structure which is required for any kind of event channel setup.
> 
>>
>> (3) The master clock is not used.
>>
>> We fall into get_kvmclock_base_ns() + ka->kvmclock_offset. The behavior is not
>> changed. This looks good.
>>
>>
>> Just from my own point: as this patch involves relatively complex changes, I
>> would suggest resolve the issue, but not use a temporary solution :)
>>
> 
> This is the conversation I had with Paul on Tuesday, when he wanted me
> to fix up this "(3) / behaviour is not changed" case. And yes, I argued
> that we *don't* want a temporary solution for this case. Because yes:
> 
>> (I see you mentioned that you will be back with get_kvmclock_ns())
> 
> We need to fix get_kvmclock_ns() anyway. The systemic drift *and* the
> fact that we periodically clamp it to a different clock and make it
> jump. I was working on the former and have something half-done but was
> preempted by the realisation that the QEMU soft freeze is today, and I
> needed to flush my QEMU patch queue.
> 
> But even once we fix get_kvmclock_ns(), *this* patch stands. Because it
> *also* addresses the "now" problem, where we get the time by one clock
> ... and then some time passes ... and we get the time by another clock,
> and subtract one from the other as if they were the *same* time.
> 
> Using kvm_get_monotonic_and_clockread() gives us a single TSC read
> corresponding to the CLOCK_MONOTONIC time, from which we can calculate
> the kvmclock time. We just *happen* to calculate it correctly here,
> unlike anywhere else in KVM.
> 
>> Based on your bug fix, I see the below cases:
>>
>> If master clock is not used:
>>     get_kvmclock_base_ns() + ka->kvmclock_offset
>>
>> If master clock is used:
>>     If pvclock is enabled:
>>         use the &vcpu->arch.hv_clock to get current guest time
>>     Else
>>         create a temporary hv_clock, based on masterclock.
> 
> I don't want to do that last 'else' clause yet, because that feels like
> a temporary workaround. It should be enough to call get_kvmclock_ns(),
> once we fix it.
> 
> 
> 

Thank you very much for the detailed explanation.

I agree it is important to resolve the "now" problem. I guess the KVM lapic
deadline timer has the "now" problem as well.


I just notice my question missed a key prerequisite:

Would you mind helping explain the time domain of the "oneshot.timeout_abs_ns"?

While it is the absolute nanosecond value at the VM side, on which time domain
it is based?

1. Is oneshot.timeout_abs_ns based on the xen pvclock (freq=NSEC_PER_SEC)?

2. Is oneshot.timeout_abs_ns based on tsc from VM side?

3. Is oneshot.timeout_abs_ns based on monotonic/raw clock at VM side?

4. Or it is based on wallclock?

I think the OS does not have a concept of nanoseconds. It is derived from a
clocksource.



If it is based on pvclock, is it based on the pvclock from a specific vCPU, as
both pvclock and timer are per-vCPU.


E.g., according to the KVM lapic deadline timer, all values are based on (1) the
tsc value, (2)on the current vCPU.


1949 static void start_sw_tscdeadline(struct kvm_lapic *apic)
1950 {
1951         struct kvm_timer *ktimer = &apic->lapic_timer;
1952         u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
1953         u64 ns = 0;
1954         ktime_t expire;
1955         struct kvm_vcpu *vcpu = apic->vcpu;
1956         unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
1957         unsigned long flags;
1958         ktime_t now;
1959
1960         if (unlikely(!tscdeadline || !this_tsc_khz))
1961                 return;
1962
1963         local_irq_save(flags);
1964
1965         now = ktime_get();
1966         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
1967
1968         ns = (tscdeadline - guest_tsc) * 1000000ULL;
1969         do_div(ns, this_tsc_khz);


Sorry if I make the question very confusing. The core question is: where and
from which clocksource the abs nanosecond value is from? What will happen if the
Xen VM uses HPET as clocksource, while xen timer as clock event?

Regardless the clocksource, KVM VM may always use current vCPU's tsc in the
lapic deadline timer.

Thank you very much!

Dongli Zhang
David Woodhouse Nov. 7, 2023, 11:24 p.m. UTC | #4
On Tue, 2023-11-07 at 15:07 -0800, Dongli Zhang wrote:
> Thank you very much for the detailed explanation.
> 
> I agree it is important to resolve the "now" problem. I guess the KVM lapic
> deadline timer has the "now" problem as well.

I think so. And quite gratuitously so, since it just does:

	now = ktime_get();
	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());


Couldn't that trivially be changed to kvm_get_monotonic_and_clockread()?

Thankfully, it's defined in the time domain of the guest TSC, not the
kvmclock, so it doesn't suffer the same drift issue as the Xen timer.

> I just notice my question missed a key prerequisite:
> 
> Would you mind helping explain the time domain of the "oneshot.timeout_abs_ns"?
> 
> While it is the absolute nanosecond value at the VM side, on which time domain
> it is based?

It's the kvmclock. Xen offers as Xen PV clock to its guests using
*precisely* the same pvclock structure as KVM does. 


> 1. Is oneshot.timeout_abs_ns based on the xen pvclock (freq=NSEC_PER_SEC)?
> 
> 2. Is oneshot.timeout_abs_ns based on tsc from VM side?
> 
> 3. Is oneshot.timeout_abs_ns based on monotonic/raw clock at VM side?
> 
> 4. Or it is based on wallclock?
> 
> I think the OS does not have a concept of nanoseconds. It is derived from a
> clocksource.

It's the kvmclock.

The guest derives it from the guest TSC using the pvclock information
(mul/shift/offset) that KVM provides to the guest.

The kvm_setup_guest_pvclock() function is potentially called *three*
times from kvm_guest_time_update(). Once for the KVM pv time MSR, once
for the pvclock structure in the Xen vcpu_info, and finally for the
pvclock structure which Xen makes available to userspace for vDSO
timekeeping.

> If it is based on pvclock, is it based on the pvclock from a specific vCPU, as
> both pvclock and timer are per-vCPU.

Yes, it is per-vCPU. Although in the sane case the TSCs on all vCPUs
will match and the mul/shift/offset provided by KVM won't actually
differ. Even in the insane case where guest TSCs are out of sync,
surely the pvclock information will differ only in order to ensure that
the *result* in nanoseconds does not?

I conveniently ducked this question in my patch by only supporting the
CONSTANT_TSC case, and not the case where we happen to know the
(potentially different) TSC frequencies on all the different pCPUs and
vCPUs.


> 
> E.g., according to the KVM lapic deadline timer, all values are based on (1) the
> tsc value, (2)on the current vCPU.
> 
> 
> 1949 static void start_sw_tscdeadline(struct kvm_lapic *apic)
> 1950 {
> 1951         struct kvm_timer *ktimer = &apic->lapic_timer;
> 1952         u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
> 1953         u64 ns = 0;
> 1954         ktime_t expire;
> 1955         struct kvm_vcpu *vcpu = apic->vcpu;
> 1956         unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
> 1957         unsigned long flags;
> 1958         ktime_t now;
> 1959
> 1960         if (unlikely(!tscdeadline || !this_tsc_khz))
> 1961                 return;
> 1962
> 1963         local_irq_save(flags);
> 1964
> 1965         now = ktime_get();
> 1966         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> 1967
> 1968         ns = (tscdeadline - guest_tsc) * 1000000ULL;
> 1969         do_div(ns, this_tsc_khz);
> 
> 
> Sorry if I make the question very confusing. The core question is: where and
> from which clocksource the abs nanosecond value is from? What will happen if the
> Xen VM uses HPET as clocksource, while xen timer as clock event?

If the guest uses HPET as clocksource and Xen timer as clockevents,
then keeping itself in sync is the *guest's* problem. The Xen timer is
defined in terms of nanoseconds since guest start, as provided in the
pvclock information described above. Hope that helps!
Dongli Zhang Nov. 8, 2023, 1:43 a.m. UTC | #5
Hi David,

On 11/7/23 15:24, David Woodhouse wrote:
> On Tue, 2023-11-07 at 15:07 -0800, Dongli Zhang wrote:
>> Thank you very much for the detailed explanation.
>>
>> I agree it is important to resolve the "now" problem. I guess the KVM lapic
>> deadline timer has the "now" problem as well.
> 
> I think so. And quite gratuitously so, since it just does:
> 
> 	now = ktime_get();
> 	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> 
> 
> Couldn't that trivially be changed to kvm_get_monotonic_and_clockread()?

The core idea is to always capture the pair of (tsc, ns) at exactly the same
time point.

I have no idea how much accuracy it can improve, considering the extra costs to
inject the timer interrupt into the vCPU.

> 
> Thankfully, it's defined in the time domain of the guest TSC, not the
> kvmclock, so it doesn't suffer the same drift issue as the Xen timer.
> 
>> I just notice my question missed a key prerequisite:
>>
>> Would you mind helping explain the time domain of the "oneshot.timeout_abs_ns"?
>>
>> While it is the absolute nanosecond value at the VM side, on which time domain
>> it is based?
> 
> It's the kvmclock. Xen offers as Xen PV clock to its guests using
> *precisely* the same pvclock structure as KVM does. 
> 
> 
>> 1. Is oneshot.timeout_abs_ns based on the xen pvclock (freq=NSEC_PER_SEC)?
>>
>> 2. Is oneshot.timeout_abs_ns based on tsc from VM side?
>>
>> 3. Is oneshot.timeout_abs_ns based on monotonic/raw clock at VM side?
>>
>> 4. Or it is based on wallclock?
>>
>> I think the OS does not have a concept of nanoseconds. It is derived from a
>> clocksource.
> 
> It's the kvmclock.

Thank you very much!

Both xen clock and kvm clock are pvclock based on the same equations.

> 
> The guest derives it from the guest TSC using the pvclock information
> (mul/shift/offset) that KVM provides to the guest.
> 
> The kvm_setup_guest_pvclock() function is potentially called *three*
> times from kvm_guest_time_update(). Once for the KVM pv time MSR, once
> for the pvclock structure in the Xen vcpu_info, and finally for the
> pvclock structure which Xen makes available to userspace for vDSO
> timekeeping.
> 
>> If it is based on pvclock, is it based on the pvclock from a specific vCPU, as
>> both pvclock and timer are per-vCPU.
> 
> Yes, it is per-vCPU. Although in the sane case the TSCs on all vCPUs
> will match and the mul/shift/offset provided by KVM won't actually
> differ. Even in the insane case where guest TSCs are out of sync,
> surely the pvclock information will differ only in order to ensure that
> the *result* in nanoseconds does not?
> 
> I conveniently ducked this question in my patch by only supporting the
> CONSTANT_TSC case, and not the case where we happen to know the
> (potentially different) TSC frequencies on all the different pCPUs and
> vCPUs.

This is also my question that why to support only the CONSTANT_TSC case.

For the lapic timer case:

The timer is always calculated based on the *current* vCPU's tsc virtualization,
regardless CONSTANT_TSC or not.

For the xen timer case:

Why not always calculate the expire based on the *current* vCPU's time
virtualization? That is, why not always use the current vCPU's hv_clock,
regardless CONSTANT_TSC/masteclock?

That is: kvm lapic method with kvm_get_monotonic_and_clockread().

> 
> 
>>
>> E.g., according to the KVM lapic deadline timer, all values are based on (1) the
>> tsc value, (2)on the current vCPU.
>>
>>
>> 1949 static void start_sw_tscdeadline(struct kvm_lapic *apic)
>> 1950 {
>> 1951         struct kvm_timer *ktimer = &apic->lapic_timer;
>> 1952         u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
>> 1953         u64 ns = 0;
>> 1954         ktime_t expire;
>> 1955         struct kvm_vcpu *vcpu = apic->vcpu;
>> 1956         unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
>> 1957         unsigned long flags;
>> 1958         ktime_t now;
>> 1959
>> 1960         if (unlikely(!tscdeadline || !this_tsc_khz))
>> 1961                 return;
>> 1962
>> 1963         local_irq_save(flags);
>> 1964
>> 1965         now = ktime_get();
>> 1966         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>> 1967
>> 1968         ns = (tscdeadline - guest_tsc) * 1000000ULL;
>> 1969         do_div(ns, this_tsc_khz);
>>
>>
>> Sorry if I make the question very confusing. The core question is: where and
>> from which clocksource the abs nanosecond value is from? What will happen if the
>> Xen VM uses HPET as clocksource, while xen timer as clock event?
> 
> If the guest uses HPET as clocksource and Xen timer as clockevents,
> then keeping itself in sync is the *guest's* problem. The Xen timer is
> defined in terms of nanoseconds since guest start, as provided in the
> pvclock information described above. Hope that helps!
>

The "in terms of nanoseconds since guest start" refers to *one* global value.
Should we use wallclock when we are referring to a global value shared by all vCPUs?


Based on the following piece of code, I do not think we may assume all vCPUs
have the same pvclock at the same time point: line 104-108, when
PVCLOCK_TSC_STABLE_BIT is not set.


 67 static __always_inline
 68 u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd)
 69 {
 70         unsigned version;
 71         u64 ret;
 72         u64 last;
 73         u8 flags;
 74
 75         do {
 76                 version = pvclock_read_begin(src);
 77                 ret = __pvclock_read_cycles(src, rdtsc_ordered());
 78                 flags = src->flags;
 79         } while (pvclock_read_retry(src, version));
... ...
104         last = raw_atomic64_read(&last_value);
105         do {
106                 if (ret <= last)
107                         return last;
108         } while (!raw_atomic64_try_cmpxchg(&last_value, &last, ret));
109
110         return ret;
111 }


That's why I appreciate a definition of the abs nanoseconds used by the xen
timer (e.g., derived from pvclock). If it is per-vCPU, we may not use it for a
global "in terms of nanoseconds since guest start", when PVCLOCK_TSC_STABLE_BIT
is not set.


Thank you very much!

Dongli Zhang
David Woodhouse Nov. 8, 2023, 3:25 p.m. UTC | #6
On Tue, 2023-11-07 at 17:43 -0800, Dongli Zhang wrote:
> Hi David,
> 
> On 11/7/23 15:24, David Woodhouse wrote:
> > On Tue, 2023-11-07 at 15:07 -0800, Dongli Zhang wrote:
> > > Thank you very much for the detailed explanation.
> > > 
> > > I agree it is important to resolve the "now" problem. I guess the KVM lapic
> > > deadline timer has the "now" problem as well.
> > 
> > I think so. And quite gratuitously so, since it just does:
> > 
> >         now = ktime_get();
> >         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > 
> > 
> > Couldn't that trivially be changed to kvm_get_monotonic_and_clockread()?
> 
> The core idea is to always capture the pair of (tsc, ns) at exactly the same
> time point.
> 
> I have no idea how much accuracy it can improve, considering the extra costs to
> inject the timer interrupt into the vCPU.

Right. It's probably in the noise most of the time, unless you're
unlucky enough to get preempted between the two TSC reads which are
supposed to be happening "at the same time".
> 

> > I conveniently ducked this question in my patch by only supporting the
> > CONSTANT_TSC case, and not the case where we happen to know the
> > (potentially different) TSC frequencies on all the different pCPUs and
> > vCPUs.
> 
> This is also my question that why to support only the CONSTANT_TSC case.
> 
> For the lapic timer case:
> 
> The timer is always calculated based on the *current* vCPU's tsc virtualization,
> regardless CONSTANT_TSC or not.
> 
> For the xen timer case:
> 
> Why not always calculate the expire based on the *current* vCPU's time
> virtualization? That is, why not always use the current vCPU's hv_clock,
> regardless CONSTANT_TSC/masteclock?

The simple answer is because I wasn't sure it would work correctly in
all cases, and didn't *care* enough about the non-CONSTANT_TSC case to
prove it to myself.

Let's think about it...

In the non-CONSTANT_TSC case, each physical CPU can have a different
TSC frequency, yes? And KVM has a cpufreq notifier which triggers when
the TSC changes, and make a KVM_REQ_CLOCK_UPDATE request to any vCPU
running on the affected pCPU. With an associated IPI to ensure the vCPU
exits guest mode and will processes the update before executing any
further guest code.

If a vCPU had *previously* been running on the affected pCPU but wasn't
running when the notifier happened, then kvm_arch_vcpu_load() will make
a KVM_REQ_GLOBAL_CLOCK_UPDATE request, which will immediately update
the vCPU in question, and then trigger a deferred KVM_REQ_CLOCK_UPDATE
for the others.

So the vCPU itself, in guest mode, is always going to see *correct*
pvclock information corresponding to the pCPU it is running on at the
time.

(I *believe* the way this works is that when a vCPU runs on a pCPU
which has a TSC frequency lower than the vCPU should have, it runs in
'always catchup' mode. Where the TSC offset is bumped *every* time the
vCPU enters guest mode, so the TSC is about right on every entry, might
seem to run a little slow if the vCPU does a tight loop of rdtsc, but
will catch up again on next vmexit/entry?)

But we aren't talking about the vCPU running in guest mode. The code in
kvm_xen_start_timer() and in start_sw_tscdeadline() is running in the
host kernel. How can we be sure that it's running on the *same*
physical CPU that the vCPU was previously running on, and thus how can
we be sure that the vcpu->arch.hv_clock is valid with respect to a
rdtsc on the current pCPU? I don't know that we can know that.

As far as I can tell, the code in start_sw_tscdeadline() makes no
attempt to do the 'catchup' thing, and just converts the pCPU's TSC to
guest TSC using kvm_read_l1_tsc() — which uses a multiplier that's set
once and *never* recalculated when the host TSC frequency changes.

On the whole, now I *have* thought about it, I'm even more convinced I
was right in the first place that I didn't want to know :)

I think I stand by my original decision that the Xen timer code in the
non-CONSTANT_TSC case can just use get_kvmclock_ns(). The "now" problem
is going to be in the noise if the TSC isn't constant anyway, and we
need to fix the drift and jumps of get_kvmclock_ns() *anyway* rather
than adding a temporary special case for the Xen timers.

> That is: kvm lapic method with kvm_get_monotonic_and_clockread().
> 
> > 
> > 
> > > 
> > > E.g., according to the KVM lapic deadline timer, all values are based on (1) the
> > > tsc value, (2)on the current vCPU.
> > > 
> > > 
> > > 1949 static void start_sw_tscdeadline(struct kvm_lapic *apic)
> > > 1950 {
> > > 1951         struct kvm_timer *ktimer = &apic->lapic_timer;
> > > 1952         u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
> > > 1953         u64 ns = 0;
> > > 1954         ktime_t expire;
> > > 1955         struct kvm_vcpu *vcpu = apic->vcpu;
> > > 1956         unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
> > > 1957         unsigned long flags;
> > > 1958         ktime_t now;
> > > 1959
> > > 1960         if (unlikely(!tscdeadline || !this_tsc_khz))
> > > 1961                 return;
> > > 1962
> > > 1963         local_irq_save(flags);
> > > 1964
> > > 1965         now = ktime_get();
> > > 1966         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > > 1967
> > > 1968         ns = (tscdeadline - guest_tsc) * 1000000ULL;
> > > 1969         do_div(ns, this_tsc_khz);
> > > 
> > > 
> > > Sorry if I make the question very confusing. The core question is: where and
> > > from which clocksource the abs nanosecond value is from? What will happen if the
> > > Xen VM uses HPET as clocksource, while xen timer as clock event?
> > 
> > If the guest uses HPET as clocksource and Xen timer as clockevents,
> > then keeping itself in sync is the *guest's* problem. The Xen timer is
> > defined in terms of nanoseconds since guest start, as provided in the
> > pvclock information described above. Hope that helps!
> > 
> 
> The "in terms of nanoseconds since guest start" refers to *one* global value.
> Should we use wallclock when we are referring to a global value shared by all vCPUs?
> 
> 
> Based on the following piece of code, I do not think we may assume all vCPUs
> have the same pvclock at the same time point: line 104-108, when
> PVCLOCK_TSC_STABLE_BIT is not set.
> 

The *result* of calculating the pvclock should be the same on all vCPUs
at any given moment in time.

The precise *calculation* may differ, depending on the frequency of the
TSC for that particular vCPU and the last time the pvclock information
was created for that vCPU.


> 
>  67 static __always_inline
>  68 u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd)
>  69 {
>  70         unsigned version;
>  71         u64 ret;
>  72         u64 last;
>  73         u8 flags;
>  74
>  75         do {
>  76                 version = pvclock_read_begin(src);
>  77                 ret = __pvclock_read_cycles(src, rdtsc_ordered());
>  78                 flags = src->flags;
>  79         } while (pvclock_read_retry(src, version));
> ... ...
> 104         last = raw_atomic64_read(&last_value);
> 105         do {
> 106                 if (ret <= last)
> 107                         return last;
> 108         } while (!raw_atomic64_try_cmpxchg(&last_value, &last, ret));
> 109
> 110         return ret;
> 111 }
> 
> 
> That's why I appreciate a definition of the abs nanoseconds used by the xen
> timer (e.g., derived from pvclock). If it is per-vCPU, we may not use it for a
> global "in terms of nanoseconds since guest start", when PVCLOCK_TSC_STABLE_BIT
> is not set.

It is only per-vCPU if the vCPUs have *different* TSC frequencies.
That's because of the scaling; the guest calculates the nanoseconds
from the *guest* TSC of course, scaled according to the pvclock
information given to the guest by KVM.

As discussed and demonstrated by http://david.woodhou.se/tsdrift.c , if
KVM scales directly to nanoseconds from the *host* TSC at its known
frequency, that introduces a systemic drift between what the guest
calculates, and what KVM calculates — even in the CONSTANT_TSC case.

How do we reconcile the two? Well, it makes no sense for the definition
of the pvclock to be something that the guest *cannot* calculate, so
obviously KVM must do the same calculations the guest does; scale to
the guest TSC (kvm_read_l1_tsc()) and then apply the same pvclock
information from vcpu->arch.hvclock to get the nanoseconds.

In the sane world where the guest vCPUs all have the *same* TSC
frequency, that's fine. The kvmclock isn't *really* per-vCPU because
they're all the same. 

If the VMM sets up different vCPUs to have different TSC frequencies
then yes, their kvmclock will drift slightly apart over time. That
might be the *one* case where I will accept that the guest pvclock
might ever change, even in the CONSTANT_TSC environment (without host
suspend or any other nonsense).

In that patch I started typing on Monday and *still* haven't got round
to finishing because other things keep catching fire, I'm using the
*KVM-wide* guest TSC frequency as the definition for the kvmclock.
Dongli Zhang Nov. 8, 2023, 6:22 p.m. UTC | #7
Hi David,

On 11/8/23 07:25, David Woodhouse wrote:
> On Tue, 2023-11-07 at 17:43 -0800, Dongli Zhang wrote:
>> Hi David,
>>
>> On 11/7/23 15:24, David Woodhouse wrote:
>>> On Tue, 2023-11-07 at 15:07 -0800, Dongli Zhang wrote:
>>>> Thank you very much for the detailed explanation.
>>>>
>>>> I agree it is important to resolve the "now" problem. I guess the KVM lapic
>>>> deadline timer has the "now" problem as well.
>>>
>>> I think so. And quite gratuitously so, since it just does:
>>>
>>>         now = ktime_get();
>>>         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>>>
>>>
>>> Couldn't that trivially be changed to kvm_get_monotonic_and_clockread()?
>>
>> The core idea is to always capture the pair of (tsc, ns) at exactly the same
>> time point.
>>
>> I have no idea how much accuracy it can improve, considering the extra costs to
>> inject the timer interrupt into the vCPU.
> 
> Right. It's probably in the noise most of the time, unless you're
> unlucky enough to get preempted between the two TSC reads which are
> supposed to be happening "at the same time".
>>
> 
>>> I conveniently ducked this question in my patch by only supporting the
>>> CONSTANT_TSC case, and not the case where we happen to know the
>>> (potentially different) TSC frequencies on all the different pCPUs and
>>> vCPUs.
>>
>> This is also my question that why to support only the CONSTANT_TSC case.
>>
>> For the lapic timer case:
>>
>> The timer is always calculated based on the *current* vCPU's tsc virtualization,
>> regardless CONSTANT_TSC or not.
>>
>> For the xen timer case:
>>
>> Why not always calculate the expire based on the *current* vCPU's time
>> virtualization? That is, why not always use the current vCPU's hv_clock,
>> regardless CONSTANT_TSC/masteclock?
> 
> The simple answer is because I wasn't sure it would work correctly in
> all cases, and didn't *care* enough about the non-CONSTANT_TSC case to
> prove it to myself.
> 
> Let's think about it...
> 
> In the non-CONSTANT_TSC case, each physical CPU can have a different
> TSC frequency, yes? And KVM has a cpufreq notifier which triggers when
> the TSC changes, and make a KVM_REQ_CLOCK_UPDATE request to any vCPU
> running on the affected pCPU. With an associated IPI to ensure the vCPU
> exits guest mode and will processes the update before executing any
> further guest code.
> 
> If a vCPU had *previously* been running on the affected pCPU but wasn't
> running when the notifier happened, then kvm_arch_vcpu_load() will make
> a KVM_REQ_GLOBAL_CLOCK_UPDATE request, which will immediately update
> the vCPU in question, and then trigger a deferred KVM_REQ_CLOCK_UPDATE
> for the others.
> 
> So the vCPU itself, in guest mode, is always going to see *correct*
> pvclock information corresponding to the pCPU it is running on at the
> time.
> 
> (I *believe* the way this works is that when a vCPU runs on a pCPU
> which has a TSC frequency lower than the vCPU should have, it runs in
> 'always catchup' mode. Where the TSC offset is bumped *every* time the
> vCPU enters guest mode, so the TSC is about right on every entry, might
> seem to run a little slow if the vCPU does a tight loop of rdtsc, but
> will catch up again on next vmexit/entry?)
> 
> But we aren't talking about the vCPU running in guest mode. The code in
> kvm_xen_start_timer() and in start_sw_tscdeadline() is running in the
> host kernel. How can we be sure that it's running on the *same*
> physical CPU that the vCPU was previously running on, and thus how can
> we be sure that the vcpu->arch.hv_clock is valid with respect to a
> rdtsc on the current pCPU? I don't know that we can know that.
> 
> As far as I can tell, the code in start_sw_tscdeadline() makes no
> attempt to do the 'catchup' thing, and just converts the pCPU's TSC to
> guest TSC using kvm_read_l1_tsc() — which uses a multiplier that's set
> once and *never* recalculated when the host TSC frequency changes.
> 
> On the whole, now I *have* thought about it, I'm even more convinced I
> was right in the first place that I didn't want to know :)
> 
> I think I stand by my original decision that the Xen timer code in the
> non-CONSTANT_TSC case can just use get_kvmclock_ns(). The "now" problem
> is going to be in the noise if the TSC isn't constant anyway, and we
> need to fix the drift and jumps of get_kvmclock_ns() *anyway* rather
> than adding a temporary special case for the Xen timers.
> 
>> That is: kvm lapic method with kvm_get_monotonic_and_clockread().
>>
>>>
>>>
>>>>
>>>> E.g., according to the KVM lapic deadline timer, all values are based on (1) the
>>>> tsc value, (2)on the current vCPU.
>>>>
>>>>
>>>> 1949 static void start_sw_tscdeadline(struct kvm_lapic *apic)
>>>> 1950 {
>>>> 1951         struct kvm_timer *ktimer = &apic->lapic_timer;
>>>> 1952         u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
>>>> 1953         u64 ns = 0;
>>>> 1954         ktime_t expire;
>>>> 1955         struct kvm_vcpu *vcpu = apic->vcpu;
>>>> 1956         unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
>>>> 1957         unsigned long flags;
>>>> 1958         ktime_t now;
>>>> 1959
>>>> 1960         if (unlikely(!tscdeadline || !this_tsc_khz))
>>>> 1961                 return;
>>>> 1962
>>>> 1963         local_irq_save(flags);
>>>> 1964
>>>> 1965         now = ktime_get();
>>>> 1966         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>>>> 1967
>>>> 1968         ns = (tscdeadline - guest_tsc) * 1000000ULL;
>>>> 1969         do_div(ns, this_tsc_khz);
>>>>
>>>>
>>>> Sorry if I make the question very confusing. The core question is: where and
>>>> from which clocksource the abs nanosecond value is from? What will happen if the
>>>> Xen VM uses HPET as clocksource, while xen timer as clock event?
>>>
>>> If the guest uses HPET as clocksource and Xen timer as clockevents,
>>> then keeping itself in sync is the *guest's* problem. The Xen timer is
>>> defined in terms of nanoseconds since guest start, as provided in the
>>> pvclock information described above. Hope that helps!
>>>
>>
>> The "in terms of nanoseconds since guest start" refers to *one* global value.
>> Should we use wallclock when we are referring to a global value shared by all vCPUs?
>>
>>
>> Based on the following piece of code, I do not think we may assume all vCPUs
>> have the same pvclock at the same time point: line 104-108, when
>> PVCLOCK_TSC_STABLE_BIT is not set.
>>
> 
> The *result* of calculating the pvclock should be the same on all vCPUs
> at any given moment in time.
> 
> The precise *calculation* may differ, depending on the frequency of the
> TSC for that particular vCPU and the last time the pvclock information
> was created for that vCPU.
> 
> 
>>
>>  67 static __always_inline
>>  68 u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd)
>>  69 {
>>  70         unsigned version;
>>  71         u64 ret;
>>  72         u64 last;
>>  73         u8 flags;
>>  74
>>  75         do {
>>  76                 version = pvclock_read_begin(src);
>>  77                 ret = __pvclock_read_cycles(src, rdtsc_ordered());
>>  78                 flags = src->flags;
>>  79         } while (pvclock_read_retry(src, version));
>> ... ...
>> 104         last = raw_atomic64_read(&last_value);
>> 105         do {
>> 106                 if (ret <= last)
>> 107                         return last;
>> 108         } while (!raw_atomic64_try_cmpxchg(&last_value, &last, ret));
>> 109
>> 110         return ret;
>> 111 }
>>
>>
>> That's why I appreciate a definition of the abs nanoseconds used by the xen
>> timer (e.g., derived from pvclock). If it is per-vCPU, we may not use it for a
>> global "in terms of nanoseconds since guest start", when PVCLOCK_TSC_STABLE_BIT
>> is not set.
> 
> It is only per-vCPU if the vCPUs have *different* TSC frequencies.
> That's because of the scaling; the guest calculates the nanoseconds
> from the *guest* TSC of course, scaled according to the pvclock
> information given to the guest by KVM.
> 
> As discussed and demonstrated by http://david.woodhou.se/tsdrift.c , if
> KVM scales directly to nanoseconds from the *host* TSC at its known
> frequency, that introduces a systemic drift between what the guest
> calculates, and what KVM calculates — even in the CONSTANT_TSC case.
> 
> How do we reconcile the two? Well, it makes no sense for the definition
> of the pvclock to be something that the guest *cannot* calculate, so
> obviously KVM must do the same calculations the guest does; scale to
> the guest TSC (kvm_read_l1_tsc()) and then apply the same pvclock
> information from vcpu->arch.hvclock to get the nanoseconds.
> 
> In the sane world where the guest vCPUs all have the *same* TSC
> frequency, that's fine. The kvmclock isn't *really* per-vCPU because
> they're all the same. 
> 
> If the VMM sets up different vCPUs to have different TSC frequencies
> then yes, their kvmclock will drift slightly apart over time. That
> might be the *one* case where I will accept that the guest pvclock
> might ever change, even in the CONSTANT_TSC environment (without host
> suspend or any other nonsense).
> 
> In that patch I started typing on Monday and *still* haven't got round
> to finishing because other things keep catching fire, I'm using the
> *KVM-wide* guest TSC frequency as the definition for the kvmclock.
> 
> 

Thank you very much for the explanation.

I understand you may use different methods to obtain the 'expire' under
different cases.

Maybe add some comments in the KVM code of xen timer emulation? E.g.:

- When the TSC is reliable, follow the standard/protocol that xen timer is
per-vCPU pvclock based: that is, to always scale host_tsc with kvm_read_l1_tsc().

- However, sometimes TSC is not reliable. Use the legacy method get_kvmclock_ns().

This may help developers understand the standard/protocol used by xen timer. The
core idea will be: the implementation is trying to following the xen timer
nanoseconds definition (per-vCPU pvclock), and it may use other legacy solution
under special case, in order to improve the accuracy.

TBH, I never think about what the definition of nanosecond is in xen timer (even
I used to and I am still working on some xen issue).

Thank you very much!

Dongli Zhang
David Woodhouse Nov. 8, 2023, 7:14 p.m. UTC | #8
On Wed, 2023-11-08 at 10:22 -0800, Dongli Zhang wrote:
> 
> Thank you very much for the explanation.
> 
> I understand you may use different methods to obtain the 'expire' under
> different cases.
> 
> Maybe add some comments in the KVM code of xen timer emulation? E.g.:
> 
> - When the TSC is reliable, follow the standard/protocol that xen timer is
> per-vCPU pvclock based: that is, to always scale host_tsc with kvm_read_l1_tsc().
> 
> - However, sometimes TSC is not reliable. Use the legacy method get_kvmclock_ns().

After the patch we're discussing here, kvm_xen_start_timer() is *more*
comment than code. I think it covers both of the points you mention
above, and more.


static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs,
				bool linux_wa)
{
	uint64_t guest_now;
	int64_t kernel_now, delta;

	/*
	 * The guest provides the requested timeout in absolute nanoseconds
	 * of the KVM clock — as *it* sees it, based on the scaled TSC and
	 * the pvclock information provided by KVM.
	 *
	 * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW
	 * so use CLOCK_MONOTONIC. In the timescales covered by timers, the
	 * difference won't matter much as there is no cumulative effect.
	 *
	 * Calculate the time for some arbitrary point in time around "now"
	 * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the
	 * delta between the kvmclock "now" value and the guest's requested
	 * timeout, apply the "Linux workaround" described below, and add
	 * the resulting delta to the CLOCK_MONOTONIC "now" value, to get
	 * the absolute CLOCK_MONOTONIC time at which the timer should
	 * fire.
	 */
	if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock &&
	    static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
		uint64_t host_tsc, guest_tsc;

		if (!IS_ENABLED(CONFIG_64BIT) ||
		    !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) {
			/*
			 * Don't fall back to get_kvmclock_ns() because it's
			 * broken; it has a systemic error in its results
			 * because it scales directly from host TSC to
			 * nanoseconds, and doesn't scale first to guest TSC
			 * and then* to nanoseconds as the guest does.
			 *
			 * There is a small error introduced here because time
			 * continues to elapse between the ktime_get() and the
			 * subsequent rdtsc(). But not the systemic drift due
			 * to get_kvmclock_ns().
			 */
			kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */
			host_tsc = rdtsc();
		}

		/* Calculate the guest kvmclock as the guest would do it. */
		guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
		guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock,
						  guest_tsc);
	} else {
		/*
		 * Without CONSTANT_TSC, get_kvmclock_ns() is the only option.
		 *
		 * Also if the guest PV clock hasn't been set up yet, as is
		 * likely to be the case during migration when the vCPU has
		 * not been run yet. It would be possible to calculate the
		 * scaling factors properly in that case but there's not much
		 * point in doing so. The get_kvmclock_ns() drift accumulates
		 * over time, so it's OK to use it at startup. Besides, on
		 * migration there's going to be a little bit of skew in the
		 * precise moment at which timers fire anyway. Often they'll
		 * be in the "past" by the time the VM is running again after
		 * migration.
		 */
		guest_now = get_kvmclock_ns(vcpu->kvm);
		kernel_now = ktime_get();
	}

	delta = guest_abs - guest_now;

	/* Xen has a 'Linux workaround' in do_set_timer_op() which
	 * checks for negative absolute timeout values (caused by
	 * integer overflow), and for values about 13 days in the
	 * future (2^50ns) which would be caused by jiffies
	 * overflow. For those cases, it sets the timeout 100ms in
	 * the future (not *too* soon, since if a guest really did
	 * set a long timeout on purpose we don't want to keep
	 * churning CPU time by waking it up).
	 */
	if (linux_wa) {
		if ((unlikely((int64_t)guest_abs < 0 ||
			      (delta > 0 && (uint32_t) (delta >> 50) != 0)))) {
			delta = 100 * NSEC_PER_MSEC;
			guest_abs = guest_now + delta;
		}
	}

	/*
	 * Avoid races with the old timer firing. Checking timer_expires
	 * to avoid calling hrtimer_cancel() will only have false positives
	 * so is fine.
	 */
	if (vcpu->arch.xen.timer_expires)
		hrtimer_cancel(&vcpu->arch.xen.timer);

	atomic_set(&vcpu->arch.xen.timer_pending, 0);
	vcpu->arch.xen.timer_expires = guest_abs;

	if (delta <= 0) {
		xen_timer_callback(&vcpu->arch.xen.timer);
	} else {
		hrtimer_start(&vcpu->arch.xen.timer,
			      ktime_add_ns(kernel_now, delta),
			      HRTIMER_MODE_ABS_HARD);
	}
}



> This may help developers understand the standard/protocol used by xen timer. The
> core idea will be: the implementation is trying to following the xen timer
> nanoseconds definition (per-vCPU pvclock), and it may use other legacy solution
> under special case, in order to improve the accuracy.

This isn't special to the Xen timers. We really are just using the
kvmclock for this. It's the same thing.

> TBH, I never think about what the definition of nanosecond is in xen timer (even
> I used to and I am still working on some xen issue).

You never had to think about it before because it was never quite so
catastrophically broken as it is under KVM. Nanoseconds were
nanoseconds and we didn't have problems because we scale and calculate
them three *different* ways and periodically clamp the guest-visible
one to a fourth. :)
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6eaab714d90a..e479637af42c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2844,7 +2844,11 @@  static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
 	return v * clock->mult;
 }
 
-static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
+/*
+ * As with get_kvmclock_base_ns(), this counts from boot time, at the
+ * frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot).
+ */
+static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
 {
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 	unsigned long seq;
@@ -2863,6 +2867,29 @@  static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
 	return mode;
 }
 
+/*
+ * This calculates CLOCK_MONOTONIC at the time of the TSC snapshot, with
+ * no boot time offset.
+ */
+static int do_monotonic(s64 *t, u64 *tsc_timestamp)
+{
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+	unsigned long seq;
+	int mode;
+	u64 ns;
+
+	do {
+		seq = read_seqcount_begin(&gtod->seq);
+		ns = gtod->clock.base_cycles;
+		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
+		ns >>= gtod->clock.shift;
+		ns += ktime_to_ns(gtod->clock.offset);
+	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
+	*t = ns;
+
+	return mode;
+}
+
 static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
 {
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
@@ -2884,18 +2911,42 @@  static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
 	return mode;
 }
 
-/* returns true if host is using TSC based clocksource */
+/*
+ * Calculates the kvmclock_base_ns (CLOCK_MONOTONIC_RAW + boot time) and
+ * reports the TSC value from which it do so. Returns true if host is
+ * using TSC based clocksource.
+ */
 static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
 {
 	/* checked again under seqlock below */
 	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
 		return false;
 
-	return gtod_is_based_on_tsc(do_monotonic_raw(kernel_ns,
-						      tsc_timestamp));
+	return gtod_is_based_on_tsc(do_kvmclock_base(kernel_ns,
+						     tsc_timestamp));
 }
 
-/* returns true if host is using TSC based clocksource */
+/*
+ * Calculates CLOCK_MONOTONIC and reports the TSC value from which it did
+ * so. Returns true if host is using TSC based clocksource.
+ */
+bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
+{
+	/* checked again under seqlock below */
+	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
+		return false;
+
+	return gtod_is_based_on_tsc(do_monotonic(kernel_ns,
+						 tsc_timestamp));
+}
+
+/*
+ * Calculates CLOCK_REALTIME and reports the TSC value from which it did
+ * so. Returns true if host is using TSC based clocksource.
+ *
+ * DO NOT USE this for anything related to migration. You want CLOCK_TAI
+ * for that.
+ */
 static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
 					   u64 *tsc_timestamp)
 {
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 1e7be1f6ab29..c08c6f729965 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -293,6 +293,7 @@  static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
 void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
 
 u64 get_kvmclock_ns(struct kvm *kvm);
+bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);
 
 int kvm_read_guest_virt(struct kvm_vcpu *vcpu,
 	gva_t addr, void *val, unsigned int bytes,
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 751d9a984668..e3d2d63eef34 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -24,6 +24,7 @@ 
 #include <xen/interface/sched.h>
 
 #include <asm/xen/cpuid.h>
+#include <asm/pvclock.h>
 
 #include "cpuid.h"
 #include "trace.h"
@@ -158,8 +159,93 @@  static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
+static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs,
+				bool linux_wa)
 {
+	uint64_t guest_now;
+	int64_t kernel_now, delta;
+
+	/*
+	 * The guest provides the requested timeout in absolute nanoseconds
+	 * of the KVM clock — as *it* sees it, based on the scaled TSC and
+	 * the pvclock information provided by KVM.
+	 *
+	 * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW
+	 * so use CLOCK_MONOTONIC. In the timescales covered by timers, the
+	 * difference won't matter much as there is no cumulative effect.
+	 *
+	 * Calculate the time for some arbitrary point in time around "now"
+	 * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the
+	 * delta between the kvmclock "now" value and the guest's requested
+	 * timeout, apply the "Linux workaround" described below, and add
+	 * the resulting delta to the CLOCK_MONOTONIC "now" value, to get
+	 * the absolute CLOCK_MONOTONIC time at which the timer should
+	 * fire.
+	 */
+	if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock &&
+	    static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+		uint64_t host_tsc, guest_tsc;
+
+		if (!IS_ENABLED(CONFIG_64BIT) ||
+		    !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) {
+			/*
+			 * Don't fall back to get_kvmclock_ns() because it's
+			 * broken; it has a systemic error in its results
+			 * because it scales directly from host TSC to
+			 * nanoseconds, and doesn't scale first to guest TSC
+			 * and then* to nanoseconds as the guest does.
+			 *
+			 * There is a small error introduced here because time
+			 * continues to elapse between the ktime_get() and the
+			 * subsequent rdtsc(). But not the systemic drift due
+			 * to get_kvmclock_ns().
+			 */
+			kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */
+			host_tsc = rdtsc();
+		}
+
+		/* Calculate the guest kvmclock as the guest would do it. */
+		guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
+		guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock,
+						  guest_tsc);
+	} else {
+		/*
+		 * Without CONSTANT_TSC, get_kvmclock_ns() is the only option.
+		 *
+		 * Also if the guest PV clock hasn't been set up yet, as is
+		 * likely to be the case during migration when the vCPU has
+		 * not been run yet. It would be possible to calculate the
+		 * scaling factors properly in that case but there's not much
+		 * point in doing so. The get_kvmclock_ns() drift accumulates
+		 * over time, so it's OK to use it at startup. Besides, on
+		 * migration there's going to be a little bit of skew in the
+		 * precise moment at which timers fire anyway. Often they'll
+		 * be in the "past" by the time the VM is running again after
+		 * migration.
+		 */
+		guest_now = get_kvmclock_ns(vcpu->kvm);
+		kernel_now = ktime_get();
+	}
+
+	delta = guest_abs - guest_now;
+
+	/* Xen has a 'Linux workaround' in do_set_timer_op() which
+	 * checks for negative absolute timeout values (caused by
+	 * integer overflow), and for values about 13 days in the
+	 * future (2^50ns) which would be caused by jiffies
+	 * overflow. For those cases, it sets the timeout 100ms in
+	 * the future (not *too* soon, since if a guest really did
+	 * set a long timeout on purpose we don't want to keep
+	 * churning CPU time by waking it up).
+	 */
+	if (linux_wa) {
+		if ((unlikely((int64_t)guest_abs < 0 ||
+			      (delta > 0 && (uint32_t) (delta >> 50) != 0)))) {
+			delta = 100 * NSEC_PER_MSEC;
+			guest_abs = guest_now + delta;
+		}
+	}
+
 	/*
 	 * Avoid races with the old timer firing. Checking timer_expires
 	 * to avoid calling hrtimer_cancel() will only have false positives
@@ -171,12 +257,11 @@  static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_
 	atomic_set(&vcpu->arch.xen.timer_pending, 0);
 	vcpu->arch.xen.timer_expires = guest_abs;
 
-	if (delta_ns <= 0) {
+	if (delta <= 0) {
 		xen_timer_callback(&vcpu->arch.xen.timer);
 	} else {
-		ktime_t ktime_now = ktime_get();
 		hrtimer_start(&vcpu->arch.xen.timer,
-			      ktime_add_ns(ktime_now, delta_ns),
+			      ktime_add_ns(kernel_now, delta),
 			      HRTIMER_MODE_ABS_HARD);
 	}
 }
@@ -945,8 +1030,7 @@  int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 		/* Start the timer if the new value has a valid vector+expiry. */
 		if (data->u.timer.port && data->u.timer.expires_ns)
 			kvm_xen_start_timer(vcpu, data->u.timer.expires_ns,
-					    data->u.timer.expires_ns -
-					    get_kvmclock_ns(vcpu->kvm));
+					    false);
 
 		r = 0;
 		break;
@@ -1389,7 +1473,6 @@  static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
 {
 	struct vcpu_set_singleshot_timer oneshot;
 	struct x86_exception e;
-	s64 delta;
 
 	if (!kvm_xen_timer_enabled(vcpu))
 		return false;
@@ -1423,9 +1506,7 @@  static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
 			return true;
 		}
 
-		/* A delta <= 0 results in an immediate callback, which is what we want */
-		delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm);
-		kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta);
+		kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false);
 		*r = 0;
 		return true;
 
@@ -1449,25 +1530,7 @@  static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout,
 		return false;
 
 	if (timeout) {
-		uint64_t guest_now = get_kvmclock_ns(vcpu->kvm);
-		int64_t delta = timeout - guest_now;
-
-		/* Xen has a 'Linux workaround' in do_set_timer_op() which
-		 * checks for negative absolute timeout values (caused by
-		 * integer overflow), and for values about 13 days in the
-		 * future (2^50ns) which would be caused by jiffies
-		 * overflow. For those cases, it sets the timeout 100ms in
-		 * the future (not *too* soon, since if a guest really did
-		 * set a long timeout on purpose we don't want to keep
-		 * churning CPU time by waking it up).
-		 */
-		if (unlikely((int64_t)timeout < 0 ||
-			     (delta > 0 && (uint32_t) (delta >> 50) != 0))) {
-			delta = 100 * NSEC_PER_MSEC;
-			timeout = guest_now + delta;
-		}
-
-		kvm_xen_start_timer(vcpu, timeout, delta);
+		kvm_xen_start_timer(vcpu, timeout, true);
 	} else {
 		kvm_xen_stop_timer(vcpu);
 	}