From patchwork Thu Dec 14 16:54:48 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 13493280 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="sxeR5bmH" Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 50C00B7; Thu, 14 Dec 2023 08:54:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=MIME-Version:Content-Type:Date:Cc:To: From:Subject:Message-ID:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:In-Reply-To:References; bh=ErI+okW83s/UYJAfbFxrtgC1DQTlCHApQ7Kig1YV3aE=; b=sxeR5bmHwCxTqL4Y8v1tWqwQpP muxS4etO6RmCJEB69n7V2e9KEimI7vX9TEQ4/znMgEOUsKkU9VL87fsGreYOtabWhnscxXmwb1bon ADlWQC4ojpOBPGaWjPqeyXhSTrXVxUBEonZkCAOH/V/4t0X4Sl0o0XsQaRJdtjsJiv0uFfmWRGCM9 8smuDig3hx7mImN9YNMG1i5sEkG+02qIA8e//J9WJEWvlZnvZhkZXGoq7evaI2i459mNqBXW/k9Xq qoUHiSHt3B5Tfs6UbY0rIlNlazdXjwk4epi13pzLnP0f+8N2R1MoQCROaTvADcnIz/tR1f7b983M+ +6/4Z6gg==; Received: from [2001:8b0:10b:5:2ca7:40ab:31e1:2312] (helo=u3832b3a9db3152.ant.amazon.com) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1rDoz2-008jZt-Aq; Thu, 14 Dec 2023 16:54:49 +0000 Message-ID: Subject: [PATCH v3] KVM: x86/xen: improve accuracy of Xen timers From: David Woodhouse To: kvm@vger.kernel.org, linux-kernel Cc: Paul Durrant , Sean Christopherson , Paolo Bonzini , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" Date: Thu, 14 Dec 2023 16:54:48 +0000 User-Agent: Evolution 3.44.4-0ubuntu2 Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-SRS-Rewrite: SMTP reverse-path rewritten from by casper.infradead.org. See http://www.infradead.org/rpr.html From: David Woodhouse A test program such as http://david.woodhou.se/timerlat.c confirms user reports that timers are increasingly inaccurate as the lifetime of a guest increases. Reporting the actual delay observed when asking for 100µs of sleep, it starts off OK on a newly-launched guest but gets worse over time, giving incorrect sleep times: root@ip-10-0-193-21:~# ./timerlat -c -n 5 00000000 latency 103243/100000 (3.2430%) 00000001 latency 103243/100000 (3.2430%) 00000002 latency 103242/100000 (3.2420%) 00000003 latency 103245/100000 (3.2450%) 00000004 latency 103245/100000 (3.2450%) The biggest problem is that get_kvmclock_ns() returns inaccurate values when the guest TSC is scaled. The guest sees a TSC value scaled from the host TSC by a mul/shift conversion (hopefully done in hardware). The guest then converts that guest TSC value into nanoseconds using the mul/shift conversion given to it by the KVM pvclock information. But get_kvmclock_ns() performs only a single conversion directly from host TSC to nanoseconds, giving a different result. A test program at http://david.woodhou.se/tsdrift.c demonstrates the cumulative error over a day. It's non-trivial to fix get_kvmclock_ns(), although I'll come back to that. The actual guest hv_clock is per-CPU, and *theoretically* each vCPU could be running at a *different* frequency. But this patch is needed anyway because... The other issue with Xen timers was that the code would snapshot the host CLOCK_MONOTONIC at some point in time, and then... after a few interrupts may have occurred, some preemption perhaps... would also read the guest's kvmclock. Then it would proceed under the false assumption that those two happened at the *same* time. Any time which *actually* elapsed between reading the two clocks was introduced as inaccuracies in the time at which the timer fired. Fix it to use a variant of kvm_get_time_and_clockread(), which reads the host TSC just *once*, then use the returned TSC value to calculate the kvmclock (making sure to do that the way the guest would instead of making the same mistake get_kvmclock_ns() does). Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen timers still have to use CLOCK_MONOTONIC. In practice the difference between the two won't matter over the timescales involved, as the *absolute* values don't matter; just the delta. This does mean a new variant of kvm_get_time_and_clockread() is needed; called kvm_get_monotonic_and_clockread() because that's what it does. Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode") Signed-off-by: David Woodhouse Reviewed-by: Paul Durrant --- v3: • Rebase and repost. v2: • Fall back to get_kvmclock_ns() if vcpu-arch.hv_clock isn't set up yet, with a big comment explaining why that's actually OK. • Fix do_monotonic() *not* to add the boot time offset. • Rename do_monotonic_raw() → do_kvmclock_base() and add a comment to make it clear that it *does* add the boot time offset. That was just left as a bear trap for the unwary developer, wasn't it?  arch/x86/kvm/x86.c |  61 +++++++++++++++++++++--  arch/x86/kvm/x86.h |   1 +  arch/x86/kvm/xen.c | 121 ++++++++++++++++++++++++++++++++++-----------  3 files changed, 149 insertions(+), 34 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6beb6ceb28c1..8faf1bf9e8e3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2852,7 +2852,11 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,         return v * clock->mult;  }   -static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp) +/* + * As with get_kvmclock_base_ns(), this counts from boot time, at the + * frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot). + */ +static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)  {         struct pvclock_gtod_data *gtod = &pvclock_gtod_data;         unsigned long seq; @@ -2871,6 +2875,29 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)         return mode;  }   +/* + * This calculates CLOCK_MONOTONIC at the time of the TSC snapshot, with + * no boot time offset. + */ +static int do_monotonic(s64 *t, u64 *tsc_timestamp) +{ +       struct pvclock_gtod_data *gtod = &pvclock_gtod_data; +       unsigned long seq; +       int mode; +       u64 ns; + +       do { +               seq = read_seqcount_begin(>od->seq); +               ns = gtod->clock.base_cycles; +               ns += vgettsc(>od->clock, tsc_timestamp, &mode); +               ns >>= gtod->clock.shift; +               ns += ktime_to_ns(gtod->clock.offset); +       } while (unlikely(read_seqcount_retry(>od->seq, seq))); +       *t = ns; + +       return mode; +} +  static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)  {         struct pvclock_gtod_data *gtod = &pvclock_gtod_data; @@ -2892,18 +2919,42 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)         return mode;  }   -/* returns true if host is using TSC based clocksource */ +/* + * Calculates the kvmclock_base_ns (CLOCK_MONOTONIC_RAW + boot time) and + * reports the TSC value from which it do so. Returns true if host is + * using TSC based clocksource. + */  static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)  {         /* checked again under seqlock below */         if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))                 return false;   -       return gtod_is_based_on_tsc(do_monotonic_raw(kernel_ns, -                                                     tsc_timestamp)); +       return gtod_is_based_on_tsc(do_kvmclock_base(kernel_ns, +                                                    tsc_timestamp));  }   -/* returns true if host is using TSC based clocksource */ +/* + * Calculates CLOCK_MONOTONIC and reports the TSC value from which it did + * so. Returns true if host is using TSC based clocksource. + */ +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp) +{ +       /* checked again under seqlock below */ +       if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode)) +               return false; + +       return gtod_is_based_on_tsc(do_monotonic(kernel_ns, +                                                tsc_timestamp)); +} + +/* + * Calculates CLOCK_REALTIME and reports the TSC value from which it did + * so. Returns true if host is using TSC based clocksource. + * + * DO NOT USE this for anything related to migration. You want CLOCK_TAI + * for that. + */  static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,                                            u64 *tsc_timestamp)  { diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 5184fde1dc54..0d6af2a57af7 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -294,6 +294,7 @@ void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);    u64 get_kvmclock_ns(struct kvm *kvm);  uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm); +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);    int kvm_read_guest_virt(struct kvm_vcpu *vcpu,         gva_t addr, void *val, unsigned int bytes, diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 6667f01170f9..a28f60aa91fb 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -24,6 +24,7 @@  #include    #include +#include    #include "cpuid.h"  #include "trace.h" @@ -149,8 +150,93 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)         return HRTIMER_NORESTART;  }   -static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) +static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, +                               bool linux_wa)  { +       uint64_t guest_now; +       int64_t kernel_now, delta; + +       /* +        * The guest provides the requested timeout in absolute nanoseconds +        * of the KVM clock — as *it* sees it, based on the scaled TSC and +        * the pvclock information provided by KVM. +        * +        * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW +        * so use CLOCK_MONOTONIC. In the timescales covered by timers, the +        * difference won't matter much as there is no cumulative effect. +        * +        * Calculate the time for some arbitrary point in time around "now" +        * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the +        * delta between the kvmclock "now" value and the guest's requested +        * timeout, apply the "Linux workaround" described below, and add +        * the resulting delta to the CLOCK_MONOTONIC "now" value, to get +        * the absolute CLOCK_MONOTONIC time at which the timer should +        * fire. +        */ +       if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock && +           static_cpu_has(X86_FEATURE_CONSTANT_TSC)) { +               uint64_t host_tsc, guest_tsc; + +               if (!IS_ENABLED(CONFIG_64BIT) || +                   !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) { +                       /* +                        * Don't fall back to get_kvmclock_ns() because it's +                        * broken; it has a systemic error in its results +                        * because it scales directly from host TSC to +                        * nanoseconds, and doesn't scale first to guest TSC +                        * and then* to nanoseconds as the guest does. +                        * +                        * There is a small error introduced here because time +                        * continues to elapse between the ktime_get() and the +                        * subsequent rdtsc(). But not the systemic drift due +                        * to get_kvmclock_ns(). +                        */ +                       kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */ +                       host_tsc = rdtsc(); +               } + +               /* Calculate the guest kvmclock as the guest would do it. */ +               guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc); +               guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock, +                                                 guest_tsc); +       } else { +               /* +                * Without CONSTANT_TSC, get_kvmclock_ns() is the only option. +                * +                * Also if the guest PV clock hasn't been set up yet, as is +                * likely to be the case during migration when the vCPU has +                * not been run yet. It would be possible to calculate the +                * scaling factors properly in that case but there's not much +                * point in doing so. The get_kvmclock_ns() drift accumulates +                * over time, so it's OK to use it at startup. Besides, on +                * migration there's going to be a little bit of skew in the +                * precise moment at which timers fire anyway. Often they'll +                * be in the "past" by the time the VM is running again after +                * migration. +                */ +               guest_now = get_kvmclock_ns(vcpu->kvm); +               kernel_now = ktime_get(); +       } + +       delta = guest_abs - guest_now; + +       /* Xen has a 'Linux workaround' in do_set_timer_op() which +        * checks for negative absolute timeout values (caused by +        * integer overflow), and for values about 13 days in the +        * future (2^50ns) which would be caused by jiffies +        * overflow. For those cases, it sets the timeout 100ms in +        * the future (not *too* soon, since if a guest really did +        * set a long timeout on purpose we don't want to keep +        * churning CPU time by waking it up). +        */ +       if (linux_wa) { +               if ((unlikely((int64_t)guest_abs < 0 || +                             (delta > 0 && (uint32_t) (delta >> 50) != 0)))) { +                       delta = 100 * NSEC_PER_MSEC; +                       guest_abs = guest_now + delta; +               } +       } +         /*          * Avoid races with the old timer firing. Checking timer_expires          * to avoid calling hrtimer_cancel() will only have false positives @@ -162,12 +248,11 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_         atomic_set(&vcpu->arch.xen.timer_pending, 0);         vcpu->arch.xen.timer_expires = guest_abs;   -       if (delta_ns <= 0) { +       if (delta <= 0) {                 xen_timer_callback(&vcpu->arch.xen.timer);         } else { -               ktime_t ktime_now = ktime_get();                 hrtimer_start(&vcpu->arch.xen.timer, -                             ktime_add_ns(ktime_now, delta_ns), +                             ktime_add_ns(kernel_now, delta),                               HRTIMER_MODE_ABS_HARD);         }  } @@ -991,8 +1076,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)                 /* Start the timer if the new value has a valid vector+expiry. */                 if (data->u.timer.port && data->u.timer.expires_ns)                         kvm_xen_start_timer(vcpu, data->u.timer.expires_ns, -                                           data->u.timer.expires_ns - -                                           get_kvmclock_ns(vcpu->kvm)); +                                           false);                   r = 0;                 break; @@ -1448,7 +1532,6 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,  {         struct vcpu_set_singleshot_timer oneshot;         struct x86_exception e; -       s64 delta;           if (!kvm_xen_timer_enabled(vcpu))                 return false; @@ -1482,9 +1565,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,                         return true;                 }   -               /* A delta <= 0 results in an immediate callback, which is what we want */ -               delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm); -               kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta); +               kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false);                 *r = 0;                 return true;   @@ -1508,25 +1589,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout,                 return false;           if (timeout) { -               uint64_t guest_now = get_kvmclock_ns(vcpu->kvm); -               int64_t delta = timeout - guest_now; - -               /* Xen has a 'Linux workaround' in do_set_timer_op() which -                * checks for negative absolute timeout values (caused by -                * integer overflow), and for values about 13 days in the -                * future (2^50ns) which would be caused by jiffies -                * overflow. For those cases, it sets the timeout 100ms in -                * the future (not *too* soon, since if a guest really did -                * set a long timeout on purpose we don't want to keep -                * churning CPU time by waking it up). -                */ -               if (unlikely((int64_t)timeout < 0 || -                            (delta > 0 && (uint32_t) (delta >> 50) != 0))) { -                       delta = 100 * NSEC_PER_MSEC; -                       timeout = guest_now + delta; -               } - -               kvm_xen_start_timer(vcpu, timeout, delta); +               kvm_xen_start_timer(vcpu, timeout, true);         } else {                 kvm_xen_stop_timer(vcpu);         }