diff mbox series

KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init()

Message ID 7e0040f70c629d365e80d13b339a95e0affa6d61.camel@infradead.org (mailing list archive)
State New, archived
Headers show
Series KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() | expand

Commit Message

David Woodhouse April 7, 2024, 1:15 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

The KVM clock is an interesting thing. It is defined as "nanoseconds
since the guest was created", but in practice it runs at two *different*
rates — or three different rates, if you count implementation bugs.

Definition A is that it runs synchronously with the CLOCK_MONOTONIC_RAW
of the host, with a delta of kvm->arch.kvmclock_offset.

But that version doesn't actually get used in the common case, where the
host has a reliable TSC and the guest TSCs are all running at the same
rate and in sync with each other, and kvm->arch.use_master_clock is set.

In that common case, definition B is used: There is a reference point in
time at kvm->arch.master_kernel_ns (again a CLOCK_MONOTONIC_RAW time),
and a corresponding host TSC value kvm->arch.master_cycle_now. This
fixed point in time is converted to guest units (the time offset by
kvmclock_offset and the TSC Value scaled and offset to be a guest TSC
value) and advertised to the guest in the pvclock structure. While in
this 'use_master_clock' mode, the fixed point in time never needs to be
changed, and the clock runs precisely in time with the guest TSC, at the
rate advertised in the pvclock structure.

The third definition C is implemented in kvm_get_wall_clock_epoch() and
__get_kvmclock(), using the master_cycle_now and master_kernel_ns fields
but converting the *host* TSC cycles directly to a value in nanoseconds
instead of scaling via the guest TSC.

One might naïvely think that all three definitions are identical, since
CLOCK_MONOTONIC_RAW is not skewed by NTP frequency corrections; all
three are just the result of counting the host TSC at a known frequency,
or the scaled guest TSC at a known precise fraction of the host's
frequency. The problem is with arithmetic precision, and the way that
frequency scaling is done in a division-free way by multiplying by a
scale factor, then shifting right. In practice, all three ways of
calculating the KVM clock will suffer a systemic drift from each other.

Definition C should simply be eliminated. Commit 451a707813ae ("KVM:
x86/xen: improve accuracy of Xen timers") worked around it for the
specific case of Xen timers, which are defined in terms of the KVM clock
and suffered from a continually increasing error in timer expiry times.

Definitions A and B do need to coexist, the former to handle the case
where the host or guest TSC is suboptimally configured. But KVM should
be more careful about switching between them, and the discontinuity in
guest time which could result.

In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of
time as the reference in master_kernel_ns and master_cycle_now, yanking
the guest's clock back to match definition A at that moment.

There is no need to do such an update when a Xen guest populates the
shared_info page. This seems to have been a hangover from the very first
implementation of shared_info which automatically populated the
vcpu_info structures at their default locations, but even then it should
just have raised KVM_REQ_CLOCK_UPDATE on each vCPU instead of using
KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is expected to
explicitly set the vcpu_info even in its default locations, there's not
even any need for that either.

Fixes: 629b5348841a1 ("KVM: x86/xen: update wallclock region")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/xen.c | 2 --
 1 file changed, 2 deletions(-)

Comments

Paul Durrant April 8, 2024, 11:12 a.m. UTC | #1
On 07/04/2024 14:15, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The KVM clock is an interesting thing. It is defined as "nanoseconds
> since the guest was created", but in practice it runs at two *different*
> rates — or three different rates, if you count implementation bugs.
> 
> Definition A is that it runs synchronously with the CLOCK_MONOTONIC_RAW
> of the host, with a delta of kvm->arch.kvmclock_offset.
> 
> But that version doesn't actually get used in the common case, where the
> host has a reliable TSC and the guest TSCs are all running at the same
> rate and in sync with each other, and kvm->arch.use_master_clock is set.
> 
> In that common case, definition B is used: There is a reference point in
> time at kvm->arch.master_kernel_ns (again a CLOCK_MONOTONIC_RAW time),
> and a corresponding host TSC value kvm->arch.master_cycle_now. This
> fixed point in time is converted to guest units (the time offset by
> kvmclock_offset and the TSC Value scaled and offset to be a guest TSC
> value) and advertised to the guest in the pvclock structure. While in
> this 'use_master_clock' mode, the fixed point in time never needs to be
> changed, and the clock runs precisely in time with the guest TSC, at the
> rate advertised in the pvclock structure.
> 
> The third definition C is implemented in kvm_get_wall_clock_epoch() and
> __get_kvmclock(), using the master_cycle_now and master_kernel_ns fields
> but converting the *host* TSC cycles directly to a value in nanoseconds
> instead of scaling via the guest TSC.
> 
> One might naïvely think that all three definitions are identical, since
> CLOCK_MONOTONIC_RAW is not skewed by NTP frequency corrections; all
> three are just the result of counting the host TSC at a known frequency,
> or the scaled guest TSC at a known precise fraction of the host's
> frequency. The problem is with arithmetic precision, and the way that
> frequency scaling is done in a division-free way by multiplying by a
> scale factor, then shifting right. In practice, all three ways of
> calculating the KVM clock will suffer a systemic drift from each other.
> 
> Definition C should simply be eliminated. Commit 451a707813ae ("KVM:
> x86/xen: improve accuracy of Xen timers") worked around it for the
> specific case of Xen timers, which are defined in terms of the KVM clock
> and suffered from a continually increasing error in timer expiry times.
> 
> Definitions A and B do need to coexist, the former to handle the case
> where the host or guest TSC is suboptimally configured. But KVM should
> be more careful about switching between them, and the discontinuity in
> guest time which could result.
> 
> In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of
> time as the reference in master_kernel_ns and master_cycle_now, yanking
> the guest's clock back to match definition A at that moment.
> 
> There is no need to do such an update when a Xen guest populates the
> shared_info page. This seems to have been a hangover from the very first
> implementation of shared_info which automatically populated the
> vcpu_info structures at their default locations, but even then it should
> just have raised KVM_REQ_CLOCK_UPDATE on each vCPU instead of using
> KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is expected to
> explicitly set the vcpu_info even in its default locations, there's not
> even any need for that either.
> 
> Fixes: 629b5348841a1 ("KVM: x86/xen: update wallclock region")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/kvm/xen.c | 2 --
>   1 file changed, 2 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>
Sean Christopherson April 10, 2024, 12:33 a.m. UTC | #2
On Sun, Apr 07, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The KVM clock is an interesting thing. It is defined as "nanoseconds
> since the guest was created", but in practice it runs at two *different*
> rates — or three different rates, if you count implementation bugs.

LOL, nice.

> Definition C should simply be eliminated. Commit 451a707813ae ("KVM:
> x86/xen: improve accuracy of Xen timers") worked around it for the
> specific case of Xen timers, which are defined in terms of the KVM clock
> and suffered from a continually increasing error in timer expiry times.

IIUC, there should probably be a "But that's a problem for a different day" line
after this.  I.e. describing 'C' is purely for context, and removing the
KVM_REQ_MASTERCLOCK_UPDATE request doesn't move the needle on eliminating this
flaw, correct?
David Woodhouse April 10, 2024, 9:17 a.m. UTC | #3
On Tue, 2024-04-09 at 17:33 -0700, Sean Christopherson wrote:
> On Sun, Apr 07, 2024, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > The KVM clock is an interesting thing. It is defined as "nanoseconds
> > since the guest was created", but in practice it runs at two *different*
> > rates — or three different rates, if you count implementation bugs.
> 
> LOL, nice.
> 
> > Definition C should simply be eliminated. Commit 451a707813ae ("KVM:
> > x86/xen: improve accuracy of Xen timers") worked around it for the
> > specific case of Xen timers, which are defined in terms of the KVM clock
> > and suffered from a continually increasing error in timer expiry times.
> 
> IIUC, there should probably be a "But that's a problem for a different day" line
> after this.  I.e. describing 'C' is purely for context, and removing the
> KVM_REQ_MASTERCLOCK_UPDATE request doesn't move the needle on eliminating this
> flaw, correct?

Correct. I have spent a while over the last few months where I've had
spare cycles (no pun intended), trying to get a full handle on just how
hosed this all is. A few times I've started on a fix for one part of it
and then hit something *else* that I needed to fix first, which renders
my first work invalid.

So now I've attempted to just write it all down and understand it, and
then come up with a TODO list, which Jack is helping me with (at least
as far as the immediate customer pain on LU/LM is concerned).

I transposed the TODO part into text form in
https://lore.kernel.org/kvm/c12959cf6ca372569b4df10b2f9e272db1114ad1.camel@infradead.org/
but here's the rest of my notes. They're a bit raw, but at least I
think I understand most of it now. Apologies, it was in an internal
wiki-style thing so the best option for export is PDF.

Once we've *fixed* things to be saner, I'll attempt to do a cleaner
documentation of what's left, to live in the kernel tree.
diff mbox series

Patch

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index f65b35a05d91..5a83a8154b79 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -98,8 +98,6 @@  static int kvm_xen_shared_info_init(struct kvm *kvm)
 	wc->version = wc_version + 1;
 	read_unlock_irq(&gpc->lock);
 
-	kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
-
 out:
 	srcu_read_unlock(&kvm->srcu, idx);
 	return ret;