Message ID | 7e0040f70c629d365e80d13b339a95e0affa6d61.camel@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() | expand |
On 07/04/2024 14:15, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > The KVM clock is an interesting thing. It is defined as "nanoseconds > since the guest was created", but in practice it runs at two *different* > rates — or three different rates, if you count implementation bugs. > > Definition A is that it runs synchronously with the CLOCK_MONOTONIC_RAW > of the host, with a delta of kvm->arch.kvmclock_offset. > > But that version doesn't actually get used in the common case, where the > host has a reliable TSC and the guest TSCs are all running at the same > rate and in sync with each other, and kvm->arch.use_master_clock is set. > > In that common case, definition B is used: There is a reference point in > time at kvm->arch.master_kernel_ns (again a CLOCK_MONOTONIC_RAW time), > and a corresponding host TSC value kvm->arch.master_cycle_now. This > fixed point in time is converted to guest units (the time offset by > kvmclock_offset and the TSC Value scaled and offset to be a guest TSC > value) and advertised to the guest in the pvclock structure. While in > this 'use_master_clock' mode, the fixed point in time never needs to be > changed, and the clock runs precisely in time with the guest TSC, at the > rate advertised in the pvclock structure. > > The third definition C is implemented in kvm_get_wall_clock_epoch() and > __get_kvmclock(), using the master_cycle_now and master_kernel_ns fields > but converting the *host* TSC cycles directly to a value in nanoseconds > instead of scaling via the guest TSC. > > One might naïvely think that all three definitions are identical, since > CLOCK_MONOTONIC_RAW is not skewed by NTP frequency corrections; all > three are just the result of counting the host TSC at a known frequency, > or the scaled guest TSC at a known precise fraction of the host's > frequency. The problem is with arithmetic precision, and the way that > frequency scaling is done in a division-free way by multiplying by a > scale factor, then shifting right. In practice, all three ways of > calculating the KVM clock will suffer a systemic drift from each other. > > Definition C should simply be eliminated. Commit 451a707813ae ("KVM: > x86/xen: improve accuracy of Xen timers") worked around it for the > specific case of Xen timers, which are defined in terms of the KVM clock > and suffered from a continually increasing error in timer expiry times. > > Definitions A and B do need to coexist, the former to handle the case > where the host or guest TSC is suboptimally configured. But KVM should > be more careful about switching between them, and the discontinuity in > guest time which could result. > > In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of > time as the reference in master_kernel_ns and master_cycle_now, yanking > the guest's clock back to match definition A at that moment. > > There is no need to do such an update when a Xen guest populates the > shared_info page. This seems to have been a hangover from the very first > implementation of shared_info which automatically populated the > vcpu_info structures at their default locations, but even then it should > just have raised KVM_REQ_CLOCK_UPDATE on each vCPU instead of using > KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is expected to > explicitly set the vcpu_info even in its default locations, there's not > even any need for that either. > > Fixes: 629b5348841a1 ("KVM: x86/xen: update wallclock region") > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > arch/x86/kvm/xen.c | 2 -- > 1 file changed, 2 deletions(-) > Reviewed-by: Paul Durrant <paul@xen.org>
On Sun, Apr 07, 2024, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > The KVM clock is an interesting thing. It is defined as "nanoseconds > since the guest was created", but in practice it runs at two *different* > rates — or three different rates, if you count implementation bugs. LOL, nice. > Definition C should simply be eliminated. Commit 451a707813ae ("KVM: > x86/xen: improve accuracy of Xen timers") worked around it for the > specific case of Xen timers, which are defined in terms of the KVM clock > and suffered from a continually increasing error in timer expiry times. IIUC, there should probably be a "But that's a problem for a different day" line after this. I.e. describing 'C' is purely for context, and removing the KVM_REQ_MASTERCLOCK_UPDATE request doesn't move the needle on eliminating this flaw, correct?
On Tue, 2024-04-09 at 17:33 -0700, Sean Christopherson wrote: > On Sun, Apr 07, 2024, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > The KVM clock is an interesting thing. It is defined as "nanoseconds > > since the guest was created", but in practice it runs at two *different* > > rates — or three different rates, if you count implementation bugs. > > LOL, nice. > > > Definition C should simply be eliminated. Commit 451a707813ae ("KVM: > > x86/xen: improve accuracy of Xen timers") worked around it for the > > specific case of Xen timers, which are defined in terms of the KVM clock > > and suffered from a continually increasing error in timer expiry times. > > IIUC, there should probably be a "But that's a problem for a different day" line > after this. I.e. describing 'C' is purely for context, and removing the > KVM_REQ_MASTERCLOCK_UPDATE request doesn't move the needle on eliminating this > flaw, correct? Correct. I have spent a while over the last few months where I've had spare cycles (no pun intended), trying to get a full handle on just how hosed this all is. A few times I've started on a fix for one part of it and then hit something *else* that I needed to fix first, which renders my first work invalid. So now I've attempted to just write it all down and understand it, and then come up with a TODO list, which Jack is helping me with (at least as far as the immediate customer pain on LU/LM is concerned). I transposed the TODO part into text form in https://lore.kernel.org/kvm/c12959cf6ca372569b4df10b2f9e272db1114ad1.camel@infradead.org/ but here's the rest of my notes. They're a bit raw, but at least I think I understand most of it now. Apologies, it was in an internal wiki-style thing so the best option for export is PDF. Once we've *fixed* things to be saner, I'll attempt to do a cleaner documentation of what's left, to live in the kernel tree.
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index f65b35a05d91..5a83a8154b79 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -98,8 +98,6 @@ static int kvm_xen_shared_info_init(struct kvm *kvm) wc->version = wc_version + 1; read_unlock_irq(&gpc->lock); - kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE); - out: srcu_read_unlock(&kvm->srcu, idx); return ret;