Message ID | 20230202165950.483430-1-sveith@amazon.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute | expand |
Please don't send vN+1 In-Reply-To vN, the threading messes up lore and other tooling, and becomes really problematic for humans when N gets large. On Thu, Feb 02, 2023, Simon Veith wrote: > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to > preserve the TSC value and apply a known offset would require > duplicating the TSC scaling computations in userspace to account for > frequency differences between source and destination TSCs. > > Hence, if userspace wants to set the TSC to some known value without > having to deal with TSC scaling, and while also being resilient against > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are > suitable options. Requiring userspace to handle certain aspects of TSC scaling doesn't seem particularly onerous, at least not relative to all the other time insanity. In other words, why should KVM take on more complexity and a mostly-redundant uAPI?
On Wed, 2023-03-15 at 12:57 -0700, Sean Christopherson wrote: > > On Thu, Feb 02, 2023, Simon Veith wrote: > > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to > > preserve the TSC value and apply a known offset would require > > duplicating the TSC scaling computations in userspace to account for > > frequency differences between source and destination TSCs. > > > > Hence, if userspace wants to set the TSC to some known value without > > having to deal with TSC scaling, and while also being resilient against > > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are > > suitable options. > > Requiring userspace to handle certain aspects of TSC scaling doesn't seem > particularly onerous, at least not relative to all the other time insanity. In > other words, why should KVM take on more complexity and a mostly-redundant uAPI? Because it offers userspace a simple and easy to use API. You get the time information, you put the time information. Now, if you want to *document* the actual calculations that userspace would be forced to do just to preserve the existing relationship between the guest TSC and the time of day / host TSC (for LM / LU respectively).... well, you can try, but it exceeded my "if it needs documenting this much, fix it first" threshold. Does userspace even *have* what it needs to do this *right*?
On 3/15/23 20:57, Sean Christopherson wrote: >> In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to >> preserve the TSC value and apply a known offset would require >> duplicating the TSC scaling computations in userspace to account for >> frequency differences between source and destination TSCs. >> >> Hence, if userspace wants to set the TSC to some known value without >> having to deal with TSC scaling, and while also being resilient against >> scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are >> suitable options. > > Requiring userspace to handle certain aspects of TSC scaling doesn't seem > particularly onerous, at least not relative to all the other time insanity. In > other words, why should KVM take on more complexity and a mostly-redundant uAPI? Yeah, it seems like the problem is that KVM_GET_CLOCK return host unscaled TSC units (which was done because the guest TSC frequency is at least in theory per-CPU, and KVM_GET_CLOCK is a vm ioctl)? Perhaps it's more important (uAPI-wise) for KVM to return the precise guest/host TSC ratio via a vcpu device attribute? Paolo
On Fri, 2023-03-24 at 12:22 +0100, Paolo Bonzini wrote: > On 3/15/23 20:57, Sean Christopherson wrote: > > > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to > > > preserve the TSC value and apply a known offset would require > > > duplicating the TSC scaling computations in userspace to account for > > > frequency differences between source and destination TSCs. > > > > > > Hence, if userspace wants to set the TSC to some known value without > > > having to deal with TSC scaling, and while also being resilient against > > > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are > > > suitable options. > > > > Requiring userspace to handle certain aspects of TSC scaling doesn't seem > > particularly onerous, at least not relative to all the other time insanity. In > > other words, why should KVM take on more complexity and a mostly-redundant uAPI? > > Yeah, it seems like the problem is that KVM_GET_CLOCK return host > unscaled TSC units (which was done because the guest TSC frequency is at > least in theory per-CPU, and KVM_GET_CLOCK is a vm ioctl)? > > Perhaps it's more important (uAPI-wise) for KVM to return the precise > guest/host TSC ratio via a vcpu device attribute? > My criteria for this are that in the case of a live update (serialize guest, kexec, resume guest on precisely the same hardware a few milliseconds of steal time later), the guest clocks (KVM_CLOCK, TSC, etc) shall be *precisely* the same as before with no slop. The same offset, the same scaling. And ideally precisely the same values advertised in the pvclock data to the guest. In the case of live migration, we can't be cycle-accurate because of course we're limited to the accuracy of the NTP sync between hosts. But that is the *only* inaccuracy we shall incur, because we can express clocks in terms of each other, e.g. <KVM clock was X at time of day Y> and then e.g. <TSC was W at KVM clock X> is unchanged from before. It's OK to expect userspace to do some calculation, as long as we never expect userspace to magically perform calculations based on some concept of "now", and to call kernel APIs, without any actual time elapsing while it does so. I said it's OK to expect userspace to do *some* calculation. But that should be clearly documented, *and* when we document it, that documentation shouldn't codify too much of the kernel's internal relationships between clocks, and shouldn't make us ashamed to be kernel engineers. We tried documenting it, in https://lore.kernel.org/all/20220316045308.2313184-1-oupton@google.com/ I don't quite know how to summarise that thread, other than "it's too broken; let's fix it first and *then* document it". But if it can be done, I'm happy for someone to fix the documentation in a way which describes how to meet the above criteria using the existing kernel APIs. And then perhaps we can make a decision about just how ashamed of ourselves we should be, and whether we want to provide a better, easier API for userspace to use.
On Fri, 2023-03-24 at 12:22 +0100, Paolo Bonzini wrote: > On 3/15/23 20:57, Sean Christopherson wrote: > > > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to > > > preserve the TSC value and apply a known offset would require > > > duplicating the TSC scaling computations in userspace to account for > > > frequency differences between source and destination TSCs. > > > > > > Hence, if userspace wants to set the TSC to some known value without > > > having to deal with TSC scaling, and while also being resilient against > > > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are > > > suitable options. > > > > Requiring userspace to handle certain aspects of TSC scaling doesn't seem > > particularly onerous, at least not relative to all the other time insanity. In > > other words, why should KVM take on more complexity and a mostly-redundant uAPI? > > Yeah, it seems like the problem is that KVM_GET_CLOCK return host > unscaled TSC units (which was done because the guest TSC frequency is at > least in theory per-CPU, and KVM_GET_CLOCK is a vm ioctl)? > > Perhaps it's more important (uAPI-wise) for KVM to return the precise > guest/host TSC ratio via a vcpu device attribute? Perhaps. Although that really does suck as a userspace experience. See the patch I just sent.
diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst index 31f14ec4a65b..240a3646947c 100644 --- a/Documentation/virt/kvm/devices/vcpu.rst +++ b/Documentation/virt/kvm/devices/vcpu.rst @@ -265,3 +265,25 @@ From the destination VMM process: 7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the respective value derived in the previous step. + +4.2 ATTRIBUTE: KVM_VCPU_TSC_VALUE + +:Parameters: kvm_device_attr.addr points to a struct kvm_vcpu_tsc_value + +Returns: + + ======= ====================================== + -EFAULT Error reading/writing the provided + parameter address. + -ENXIO Attribute not supported + ======= ====================================== + +Gets or sets a matched pair of guest TSC value and KVM clock time point. + +When setting the TSC value through this attribute, a corresponding KVM clock +reference time point (as retrieved by KVM_GET_CLOCK in the clock field) must be +provided. + +The actual TSC value written will be adjusted based on the time that has +elapsed since the provided reference time point, taking TSC scaling into +account. diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index e48deab8901d..f99bdb959b54 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -528,5 +528,12 @@ struct kvm_pmu_event_filter { /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */ #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */ #define KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */ +#define KVM_VCPU_TSC_VALUE 1 /* attribute for the TSC value */ + +/* for KVM_VCPU_TSC_VALUE */ +struct kvm_vcpu_tsc_value { + __u64 tsc_val; + __u64 kvm_ns; +}; #endif /* _ASM_X86_KVM_H */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index da4bbd043a7b..b174200c909b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2655,7 +2655,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc, kvm_track_tsc_matching(vcpu); } -static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) +static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, u64 *kvm_ns) { struct kvm *kvm = vcpu->kvm; u64 offset, ns, elapsed; @@ -2664,12 +2664,24 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) bool synchronizing = false; raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags); - offset = kvm_compute_l1_tsc_offset(vcpu, data); ns = get_kvmclock_base_ns(); + + if (kvm_ns) { + /* + * kvm_ns is the KVM clock reference time point at which this + * TSC value was correct. Use this time point to compensate for + * any delays that have been incurred since that TSC value was + * valid. + */ + s64 delta_ns = ns + vcpu->kvm->arch.kvmclock_offset - *kvm_ns; + data += nsec_to_cycles(vcpu, (u64)delta_ns); + } + + offset = kvm_compute_l1_tsc_offset(vcpu, data); elapsed = ns - kvm->arch.last_tsc_nsec; if (vcpu->arch.virtual_tsc_khz) { - if (data == 0) { + if (data == 0 && !kvm_ns) { /* * detection of vcpu initialization -- need to sync * with other vCPUs. This particularly helps to keep @@ -3672,7 +3684,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) break; case MSR_IA32_TSC: if (msr_info->host_initiated) { - kvm_synchronize_tsc(vcpu, data); + kvm_synchronize_tsc(vcpu, data, NULL); } else { u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset; adjust_tsc_offset_guest(vcpu, adj); @@ -5375,6 +5387,7 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu, switch (attr->attr) { case KVM_VCPU_TSC_OFFSET: + case KVM_VCPU_TSC_VALUE: r = 0; break; default: @@ -5400,6 +5413,32 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu, break; r = 0; break; + case KVM_VCPU_TSC_VALUE: { + struct kvm_vcpu_tsc_value __user *tsc_value_arg; + struct kvm_vcpu_tsc_value tsc_value; + struct kvm_clock_data kvm_clock; + u64 host_tsc, guest_tsc, ratio, offset; + + get_kvmclock(vcpu->kvm, &kvm_clock); + if (kvm_clock.flags & KVM_CLOCK_HOST_TSC) + host_tsc = kvm_clock.host_tsc; + else + host_tsc = rdtsc(); + + ratio = vcpu->arch.l1_tsc_scaling_ratio; + offset = vcpu->arch.l1_tsc_offset; + guest_tsc = kvm_scale_tsc(host_tsc, ratio) + offset; + + tsc_value.kvm_ns = kvm_clock.clock; + tsc_value.tsc_val = guest_tsc; + + tsc_value_arg = (struct kvm_vcpu_tsc_value __user *)uaddr; + r = -EFAULT; + if (copy_to_user(tsc_value_arg, &tsc_value, sizeof(tsc_value))) + break; + r = 0; + break; + } default: r = -ENXIO; } @@ -5442,6 +5481,20 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, r = 0; break; } + case KVM_VCPU_TSC_VALUE: { + struct kvm_vcpu_tsc_value __user *tsc_value_arg; + struct kvm_vcpu_tsc_value tsc_value; + + tsc_value_arg = (struct kvm_vcpu_tsc_value __user *)uaddr; + r = -EFAULT; + if (copy_from_user(&tsc_value, tsc_value_arg, sizeof(tsc_value))) + break; + + kvm_synchronize_tsc(vcpu, tsc_value.tsc_val, &tsc_value.kvm_ns); + + r = 0; + break; + } default: r = -ENXIO; } @@ -11668,7 +11721,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) if (mutex_lock_killable(&vcpu->mutex)) return; vcpu_load(vcpu); - kvm_synchronize_tsc(vcpu, 0); + kvm_synchronize_tsc(vcpu, 0, NULL); vcpu_put(vcpu); /* poll control enabled by default */ diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h index e48deab8901d..f99bdb959b54 100644 --- a/tools/arch/x86/include/uapi/asm/kvm.h +++ b/tools/arch/x86/include/uapi/asm/kvm.h @@ -528,5 +528,12 @@ struct kvm_pmu_event_filter { /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */ #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */ #define KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */ +#define KVM_VCPU_TSC_VALUE 1 /* attribute for the TSC value */ + +/* for KVM_VCPU_TSC_VALUE */ +struct kvm_vcpu_tsc_value { + __u64 tsc_val; + __u64 kvm_ns; +}; #endif /* _ASM_X86_KVM_H */