Message ID | 20170512141322.GC2173@potion (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 12/05/2017 16:13, Radim Krčmář wrote: > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 464da936c53d..8db1d09e59d7 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -4175,9 +4175,10 @@ long kvm_arch_vm_ioctl(struct file *filp, > goto out; > > r = 0; > + kvm_gen_update_masterclock(kvm); > now_ns = get_kvmclock_ns(kvm); > kvm->arch.kvmclock_offset += user_ns.clock - now_ns; > - kvm_gen_update_masterclock(kvm); > + kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE); > break; > } > case KVM_GET_CLOCK: { And then I think the first kvm_gen_update_masterclock should be modified to skip sending the KVM_REQ_CLOCK_UPDATE request. Paolo
On Fri, May 12, 2017 at 04:13:23PM +0200, Radim Krčmář wrote: > 2017-05-11 12:39-0300, Marcelo Tosatti: > > On Wed, May 10, 2017 at 08:04:31PM +0200, Radim Krčmář wrote: > > > 2017-05-03 10:43-0300, Marcelo Tosatti: > > > and the important fix for kvm master clock is the move of > > > kvm_gen_update_masterclock() before we read the time. > > > > The rest is just a minor optimization that also ignores time since > > > master_kernel_ns() and therefore pins user_ns.clock to a slightly > > > earlier time. > > > > > > But all attention was given to the "minor optimization" -- have I missed > > > something about the direct use of ka->master_kernel_ns? > > > > I haven't attempted to optimize anything. Not sure what you mean. > > I mean, why doesn't the patch look like this? d > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 464da936c53d..8db1d09e59d7 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -4175,9 +4175,10 @@ long kvm_arch_vm_ioctl(struct file *filp, > goto out; > > r = 0; > + kvm_gen_update_masterclock(kvm); > now_ns = get_kvmclock_ns(kvm); > kvm->arch.kvmclock_offset += user_ns.clock - now_ns; > - kvm_gen_update_masterclock(kvm); > + kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE); > break; now_ns = ka->master_kernel_ns + kvmclock_offset_prev + (grdtsc() - ka->master_cycle_now) kvmclock_offset += user_ns.clock - (ka->master_kernel_ns + kvmclock_offset_prev + (grdtsc() - ka->master_cycle_now) kvmclock_offset = kvmclock_offset_prev + user_ns.clock - (ka->master_kernel_ns + kvmclock_offset_prev + (grdtsc() - ka->master_cycle_now) In case of VM was just initialized before migration, kvmclock_offset_prev is -ktime_get_boot_ns() kvmclock_offset = -ktime_get_boot_ns() + user_ns.clock - (ka->master_kernel_ns -ktime_get_boot_ns() + grdtsc() - ka->master_cycle_now)) But master_kernel_ns = ktime_get_boot_ns() + delta-between-vm-init-and-KVM_SET_CLOCK (AKA delta) (the same one from VM init) kvmclock_offset = -ktime_get_boot_ns() + user_ns.clock - (ktime_get_boot_ns() + delta + -ktime_get_boot_ns() + grdtsc() - ka->master_cycle_now)) kvmclock_offset = -ktime_get_boot_ns() + user_ns.clock - delta - grdtsc() + ka->master_cycle_now But we don't want grdtsc() - ka->master_cycle_now in there. Note: grdtsc() == guest read tsc. Now with + kvm->arch.kvmclock_offset = user_ns.clock - + ka->master_kernel_ns; What happens is that guest clock starts counting, via kernel timekeeper, at the moment kvm_get_time_and_clockread() runs. If you add grdtsc() - ka->master_cycle_now in there, you are mindfully counting clock twice (first: kernel timekeeper, second: the TSC between the (grdtsc() - ka->master_cycle_now) in question. + kvm->arch.kvmclock_offset = -ktime_get_boot_ns() +user_ns.clock -delta Note that (grdtsc() - ka->master_cycle_now) is susceptible to scheduling etc. Makes sense?
2017-05-12 12:31-0300, Marcelo Tosatti: > On Fri, May 12, 2017 at 04:13:23PM +0200, Radim Krčmář wrote: > > 2017-05-11 12:39-0300, Marcelo Tosatti: > > > On Wed, May 10, 2017 at 08:04:31PM +0200, Radim Krčmář wrote: > > > > 2017-05-03 10:43-0300, Marcelo Tosatti: > > > > and the important fix for kvm master clock is the move of > > > > kvm_gen_update_masterclock() before we read the time. > > > > > > The rest is just a minor optimization that also ignores time since > > > > master_kernel_ns() and therefore pins user_ns.clock to a slightly > > > > earlier time. > > > > > > > > But all attention was given to the "minor optimization" -- have I missed > > > > something about the direct use of ka->master_kernel_ns? > > > > > > I haven't attempted to optimize anything. Not sure what you mean. > > > > I mean, why doesn't the patch look like this? > d > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 464da936c53d..8db1d09e59d7 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -4175,9 +4175,10 @@ long kvm_arch_vm_ioctl(struct file *filp, > > goto out; > > > > r = 0; > > + kvm_gen_update_masterclock(kvm); > > now_ns = get_kvmclock_ns(kvm); > > kvm->arch.kvmclock_offset += user_ns.clock - now_ns; > > - kvm_gen_update_masterclock(kvm); > > + kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE); > > break; > > now_ns = ka->master_kernel_ns + kvmclock_offset_prev + (grdtsc() - ka->master_cycle_now) > kvmclock_offset += user_ns.clock - (ka->master_kernel_ns + kvmclock_offset_prev + (grdtsc() - ka->master_cycle_now) > kvmclock_offset = kvmclock_offset_prev + user_ns.clock - (ka->master_kernel_ns + kvmclock_offset_prev + (grdtsc() - ka->master_cycle_now) > > In case of VM was just initialized before migration, kvmclock_offset_prev is -ktime_get_boot_ns() > > kvmclock_offset = -ktime_get_boot_ns() + user_ns.clock - (ka->master_kernel_ns -ktime_get_boot_ns() + grdtsc() - ka->master_cycle_now)) > > But master_kernel_ns = ktime_get_boot_ns() + delta-between-vm-init-and-KVM_SET_CLOCK (AKA delta) > (the same one from VM init) > > kvmclock_offset = -ktime_get_boot_ns() + user_ns.clock - (ktime_get_boot_ns() + delta + -ktime_get_boot_ns() + grdtsc() - ka->master_cycle_now)) > > kvmclock_offset = -ktime_get_boot_ns() + user_ns.clock - delta - grdtsc() + ka->master_cycle_now > > But we don't want grdtsc() - ka->master_cycle_now in there. > > Note: grdtsc() == guest read tsc. > > Now with > > + kvm->arch.kvmclock_offset = user_ns.clock - > + ka->master_kernel_ns; > > What happens is that guest clock starts counting, via kernel timekeeper, > at the moment kvm_get_time_and_clockread() runs. If you add grdtsc() - > ka->master_cycle_now in there, you are mindfully counting clock twice > (first: kernel timekeeper, second: the TSC between the (grdtsc() - > ka->master_cycle_now) in question. > > + kvm->arch.kvmclock_offset = -ktime_get_boot_ns() +user_ns.clock -delta > > Note that (grdtsc() - ka->master_cycle_now) is susceptible to scheduling > etc. > > Makes sense? Yes. The simpler code starts the kvmclock a bit later, but both are correct -- anything within KVM_SET_CLOCK runtime is. If we care about accuracy, then we should let userspace provide a (kernel timestamp, kvm timestamp) pair, so the value of kvmclock can really be controlled. Adding ugly optimizations to work around shortcomings of the API is going the wrong way ...
On Fri, May 12, 2017 at 07:37:12PM +0200, Radim Krčmář wrote: > 2017-05-12 12:31-0300, Marcelo Tosatti: > > On Fri, May 12, 2017 at 04:13:23PM +0200, Radim Krčmář wrote: > > > 2017-05-11 12:39-0300, Marcelo Tosatti: > > > > On Wed, May 10, 2017 at 08:04:31PM +0200, Radim Krčmář wrote: > > > > > 2017-05-03 10:43-0300, Marcelo Tosatti: > > > > > and the important fix for kvm master clock is the move of > > > > > kvm_gen_update_masterclock() before we read the time. > > > > > > > > The rest is just a minor optimization that also ignores time since > > > > > master_kernel_ns() and therefore pins user_ns.clock to a slightly > > > > > earlier time. > > > > > > > > > > But all attention was given to the "minor optimization" -- have I missed > > > > > something about the direct use of ka->master_kernel_ns? > > > > > > > > I haven't attempted to optimize anything. Not sure what you mean. > > > > > > I mean, why doesn't the patch look like this? > > d > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > index 464da936c53d..8db1d09e59d7 100644 > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -4175,9 +4175,10 @@ long kvm_arch_vm_ioctl(struct file *filp, > > > goto out; > > > > > > r = 0; > > > + kvm_gen_update_masterclock(kvm); > > > now_ns = get_kvmclock_ns(kvm); > > > kvm->arch.kvmclock_offset += user_ns.clock - now_ns; > > > - kvm_gen_update_masterclock(kvm); > > > + kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE); > > > break; > > > > now_ns = ka->master_kernel_ns + kvmclock_offset_prev + (grdtsc() - ka->master_cycle_now) > > kvmclock_offset += user_ns.clock - (ka->master_kernel_ns + kvmclock_offset_prev + (grdtsc() - ka->master_cycle_now) > > kvmclock_offset = kvmclock_offset_prev + user_ns.clock - (ka->master_kernel_ns + kvmclock_offset_prev + (grdtsc() - ka->master_cycle_now) > > > > In case of VM was just initialized before migration, kvmclock_offset_prev is -ktime_get_boot_ns() > > > > kvmclock_offset = -ktime_get_boot_ns() + user_ns.clock - (ka->master_kernel_ns -ktime_get_boot_ns() + grdtsc() - ka->master_cycle_now)) > > > > But master_kernel_ns = ktime_get_boot_ns() + delta-between-vm-init-and-KVM_SET_CLOCK (AKA delta) > > (the same one from VM init) > > > > kvmclock_offset = -ktime_get_boot_ns() + user_ns.clock - (ktime_get_boot_ns() + delta + -ktime_get_boot_ns() + grdtsc() - ka->master_cycle_now)) > > > > kvmclock_offset = -ktime_get_boot_ns() + user_ns.clock - delta - grdtsc() + ka->master_cycle_now > > > > But we don't want grdtsc() - ka->master_cycle_now in there. > > > > Note: grdtsc() == guest read tsc. > > > > Now with > > > > + kvm->arch.kvmclock_offset = user_ns.clock - > > + ka->master_kernel_ns; > > > > What happens is that guest clock starts counting, via kernel timekeeper, > > at the moment kvm_get_time_and_clockread() runs. If you add grdtsc() - > > ka->master_cycle_now in there, you are mindfully counting clock twice > > (first: kernel timekeeper, second: the TSC between the (grdtsc() - > > ka->master_cycle_now) in question. > > > > + kvm->arch.kvmclock_offset = -ktime_get_boot_ns() +user_ns.clock -delta > > > > Note that (grdtsc() - ka->master_cycle_now) is susceptible to scheduling > > etc. > > > > Makes sense? > > Yes. The simpler code starts the kvmclock a bit later, but both are > correct -- anything within KVM_SET_CLOCK runtime is. No the simpler code is not correct. You count time with two clocks for a small period of time. KVM_SET_CLOCK means: set the guest clock to the passed value and start counting it from there. With the simple fix, KVM_SET_CLOCK does: set the guest clock to the passed value, but also add the delta between kvm_get_time_and_clockread() and get_kvmclock_ns(). Which is variable, due to scheduling. So it is just wrong. > If we care about accuracy, then we should let userspace provide a > (kernel timestamp, kvm timestamp) pair, so the value of kvmclock can > really be controlled. I suppose something else has to be done: the control of the clock, from whatever userspace is using to measure passage of time, to TSC, has to be done in kernel. But lets see if that is really necessary when the QEMU PTP/CLOCK_MONOTONIC delta stuff is done (working on it). In the meantime, do you have anything against this patch? I depend on it for the work above. > Adding ugly optimizations to work around shortcomings of the API is > going the wrong way ... What optimization you refer to?
2017-05-13 00:46-0300, Marcelo Tosatti: > On Fri, May 12, 2017 at 07:37:12PM +0200, Radim Krčmář wrote: > > 2017-05-12 12:31-0300, Marcelo Tosatti: > > > Now with > > > > > > + kvm->arch.kvmclock_offset = user_ns.clock - > > > + ka->master_kernel_ns; > > > > > > What happens is that guest clock starts counting, via kernel timekeeper, > > > at the moment kvm_get_time_and_clockread() runs. If you add grdtsc() - > > > ka->master_cycle_now in there, you are mindfully counting clock twice > > > (first: kernel timekeeper, second: the TSC between the (grdtsc() - > > > ka->master_cycle_now) in question. > > > > > > + kvm->arch.kvmclock_offset = -ktime_get_boot_ns() +user_ns.clock -delta > > > > > > Note that (grdtsc() - ka->master_cycle_now) is susceptible to scheduling > > > etc. > > > > > > Makes sense? > > > > Yes. The simpler code starts the kvmclock a bit later, but both are > > correct -- anything within KVM_SET_CLOCK runtime is. > > No the simpler code is not correct. You count time with two clocks for a > small period of time. A clock that counts kernel-nanoseconds is instantly replaced by a clock that counts masterclock-nanoseconds, not incorrect by itself. The simpler code will get the same kvmclock_offset as your code where kvm_get_time_and_clockread() is called a bit later. The distribution of resulting kvm_offsets will differ, but they must both be correct or both incorrect, because they are already off-mark. > KVM_SET_CLOCK means: set the guest clock to the passed value and start > counting it from there. Which is exactly what both versions do. > With the simple fix, KVM_SET_CLOCK does: set the guest clock to the > passed value, but also add the delta between kvm_get_time_and_clockread() > and get_kvmclock_ns(). > > Which is variable, due to scheduling. Yes. > So it is just wrong. It makes the matter slightly worse by adding some execution time, but the whole interface is "just wrong" even without that: we already have the variability of the time between userspace's decision on user_ns.clock value and kvm_get_time_and_clockread(). >> If we care about accuracy, then we should let userspace provide a >> (kernel timestamp, kvm timestamp) pair, so the value of kvmclock can >> really be controlled. > > I suppose something else has to be done: the control of the clock, > from whatever userspace is using to measure passage of time, > to TSC, has to be done in kernel. We agree, just worded it differently. > But lets see if that is really necessary when the QEMU > PTP/CLOCK_MONOTONIC delta stuff is done (working on it). Right. > In the meantime, do you have anything against this patch? I depend on > it for the work above. The reasoning provided with the patch does not explain why * kvmclock_offset must be adjusted so that * user_ns.clock = master_kernel_ns + kvmclock_offset Please explain why it "must". I assert that it does not have to be. If we agree that it is not necessary, then it is an optimization and I'd like numbers to show that we are getting something that balances the obfuscation; and why do we want it if we don't care about the better solution (discussed above). > I depend on > it for the work above. Describing how other code couldn't work without this is great reason, but again, please be specific -- what difference it make? >> Adding ugly optimizations to work around shortcomings of the API is >> going the wrong way ... > > What optimization you refer to? I refer to everything on top of the second hunk I posted. Thanks.
On Mon, May 15, 2017 at 06:19:57PM +0200, Radim Krčmář wrote: > 2017-05-13 00:46-0300, Marcelo Tosatti: > > On Fri, May 12, 2017 at 07:37:12PM +0200, Radim Krčmář wrote: > > > 2017-05-12 12:31-0300, Marcelo Tosatti: > > > > Now with > > > > > > > > + kvm->arch.kvmclock_offset = user_ns.clock - > > > > + ka->master_kernel_ns; > > > > > > > > What happens is that guest clock starts counting, via kernel timekeeper, > > > > at the moment kvm_get_time_and_clockread() runs. If you add grdtsc() - > > > > ka->master_cycle_now in there, you are mindfully counting clock twice > > > > (first: kernel timekeeper, second: the TSC between the (grdtsc() - > > > > ka->master_cycle_now) in question. > > > > > > > > + kvm->arch.kvmclock_offset = -ktime_get_boot_ns() +user_ns.clock -delta > > > > > > > > Note that (grdtsc() - ka->master_cycle_now) is susceptible to scheduling > > > > etc. > > > > > > > > Makes sense? > > > > > > Yes. The simpler code starts the kvmclock a bit later, but both are > > > correct -- anything within KVM_SET_CLOCK runtime is. > > > > No the simpler code is not correct. You count time with two clocks for a > > small period of time. > > A clock that counts kernel-nanoseconds is instantly replaced by a clock > that counts masterclock-nanoseconds, not incorrect by itself. > > The simpler code will get the same kvmclock_offset as your code where > kvm_get_time_and_clockread() is called a bit later. > The distribution of resulting kvm_offsets will differ, but they must > both be correct or both incorrect, because they are already off-mark. > > > KVM_SET_CLOCK means: set the guest clock to the passed value and start > > counting it from there. > > Which is exactly what both versions do. > > > With the simple fix, KVM_SET_CLOCK does: set the guest clock to the > > passed value, but also add the delta between kvm_get_time_and_clockread() > > and get_kvmclock_ns(). > > > > Which is variable, due to scheduling. > > Yes. > > > So it is just wrong. > > It makes the matter slightly worse by adding some execution time, but > the whole interface is "just wrong" even without that: we already have > the variability of the time between userspace's decision on > user_ns.clock value and kvm_get_time_and_clockread(). > > >> If we care about accuracy, then we should let userspace provide a > >> (kernel timestamp, kvm timestamp) pair, so the value of kvmclock can > >> really be controlled. > > > > I suppose something else has to be done: the control of the clock, > > from whatever userspace is using to measure passage of time, > > to TSC, has to be done in kernel. > > We agree, just worded it differently. > > > But lets see if that is really necessary when the QEMU > > PTP/CLOCK_MONOTONIC delta stuff is done (working on it). > > Right. > > > In the meantime, do you have anything against this patch? I depend on > > it for the work above. > > The reasoning provided with the patch does not explain why > > * kvmclock_offset must be adjusted so that > * user_ns.clock = master_kernel_ns + kvmclock_offset > > Please explain why it "must". I assert that it does not have to be. > > If we agree that it is not necessary, then it is an optimization and I'd > like numbers to show that we are getting something that balances the > obfuscation; and why do we want it if we don't care about the better > solution (discussed above). > > > I depend on > > it for the work above. > > Describing how other code couldn't work without this is great reason, > but again, please be specific -- what difference it make? > > >> Adding ugly optimizations to work around shortcomings of the API is > >> going the wrong way ... > > > > What optimization you refer to? > > I refer to everything on top of the second hunk I posted. > > Thanks. Actually you are right, your patch is fine because the length of time between kvm_get_time_and_clockread() and get_kvmclock_ns(kvm) is compensated by - grdtsc() + ka->master_cycle_now = - ( +grdtsc() - ka->master_cycle_now) Which is the length of time between kvm_get_time_and_clockread() and get_kvmclock_ns(kvm). Its much cleaner indeed. Can you please apply it? Thanks.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 464da936c53d..8db1d09e59d7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4175,9 +4175,10 @@ long kvm_arch_vm_ioctl(struct file *filp, goto out; r = 0; + kvm_gen_update_masterclock(kvm); now_ns = get_kvmclock_ns(kvm); kvm->arch.kvmclock_offset += user_ns.clock - now_ns; - kvm_gen_update_masterclock(kvm); + kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE); break; } case KVM_GET_CLOCK: {