Message ID | 20201130133559.233242-2-mlevitsk@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | RFC: Precise TSC migration | expand |
On 30/11/20 14:35, Maxim Levitsky wrote: > + if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) { > + tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr; > + tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID; > + } This is mostly useful for userspace that doesn't disable the quirk, right? > + kvm_get_walltime(&wall_nsec, &host_tsc); > + diff = wall_nsec - tsc_state.nsec; > + > + if (diff < 0 || tsc_state.nsec == 0) > + diff = 0; > + diff < 0 should be okay. Also why the nsec==0 special case? What about using a flag instead? Paolo
On Mon, 2020-11-30 at 15:33 +0100, Paolo Bonzini wrote: > On 30/11/20 14:35, Maxim Levitsky wrote: > > + if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) { > > + tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr; > > + tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID; > > + } > > This is mostly useful for userspace that doesn't disable the quirk, right? Isn't this the opposite? If I understand the original proposal correctly, the reason that we include the TSC_ADJUST in the new ioctl, is that we would like to disable the special kvm behavior (that is disable the quirk), which would mean that tsc will jump on regular host initiated TSC_ADJUST write. To avoid this, userspace would set TSC_ADJUST through this new interface. Note that I haven't yet disabled the quirk in the patches I posted to the qemu, because we need some infrastructure to manage which quirks we want to disable in qemu (That is, KVM_ENABLE_CAP is as I understand write only, so I can't just disable KVM_X86_QUIRK_TSC_HOST_ACCESS, in the code that enables x-precise-tsc in qemu). > > > + kvm_get_walltime(&wall_nsec, &host_tsc); > > + diff = wall_nsec - tsc_state.nsec; > > + > > + if (diff < 0 || tsc_state.nsec == 0) > > + diff = 0; > > + > > diff < 0 should be okay. Also why the nsec==0 special case? What about > using a flag instead? In theory diff < 0 should indeed be okay (though this would mean that target, has unsynchronized clock or time travel happened). However for example nsec_to_cycles takes unsigned number, and then pvclock_scale_delta also takes unsigned number, and so on, so I was thinking why bother with this case. There is still (mostly?) theoretical issue, if on some vcpus 'diff' is positive and on some is negative (this can happen if the migration was really fast, and target has the clock A. that is only slightly ahead of the source). Do you think that this is an issue? If so I can make the code work with signed numbers. About nsec == 0, this is to allow to use this API for VM initialization. (That is to call KVM_SET_TSC_PRECISE prior to doing KVM_GET_TSC_PRECISE) This simplifies qemu code, and I don't think that this makes the API much worse. Best regards, Maxim Levitsky > > Paolo >
On 30/11/20 16:58, Maxim Levitsky wrote: >> This is mostly useful for userspace that doesn't disable the quirk, right? > Isn't this the opposite? If I understand the original proposal correctly, > the reason that we include the TSC_ADJUST in the new ioctl, is that > we would like to disable the special kvm behavior (that is disable the quirk), > which would mean that tsc will jump on regular host initiated TSC_ADJUST write. > > To avoid this, userspace would set TSC_ADJUST through this new interface. Yeah, that makes sense. It removes the need to think "I have to set TSC adjust before TSC". > Do you think that this is an issue? If so I can make the code work with > signed numbers. Not sure if it's an issue, but I prefer to make the API "less surprising" for userspace. Who knows how it will be used. > About nsec == 0, this is to allow to use this API for VM initialization. > (That is to call KVM_SET_TSC_PRECISE prior to doing KVM_GET_TSC_PRECISE) I prefer using flags for that purpose. Paolo
On Mon, Nov 30 2020 at 15:35, Maxim Levitsky wrote: > + struct kvm_tsc_info { > + __u32 flags; > + __u64 nsec; > + __u64 tsc; > + __u64 tsc_adjust; > + }; > + > +flags values for ``struct kvm_tsc_info``: > + > +``KVM_TSC_INFO_TSC_ADJUST_VALID`` > + > + ``tsc_adjust`` contains valid IA32_TSC_ADJUST value Why exposing TSC_ADJUST at all? Just because? Thanks, tglx
On Tue, 2020-12-01 at 20:43 +0100, Thomas Gleixner wrote: > On Mon, Nov 30 2020 at 15:35, Maxim Levitsky wrote: > > + struct kvm_tsc_info { > > + __u32 flags; > > + __u64 nsec; > > + __u64 tsc; > > + __u64 tsc_adjust; > > + }; > > + > > +flags values for ``struct kvm_tsc_info``: > > + > > +``KVM_TSC_INFO_TSC_ADJUST_VALID`` > > + > > + ``tsc_adjust`` contains valid IA32_TSC_ADJUST value > > Why exposing TSC_ADJUST at all? Just because? It's because we want to reduce the number of cases where KVM's msr/read write behavior differs between guest and host (e.g qemu) writes. TSC and TSC_ADJUST are tied on architectural level, such as chang ing one, changes the other. However for the migration to work we must be able to set each one separately. Currently, KVM does this by turning the host write to TSC_ADJUST into a special case that bypasses the actual TSC adjustment, and just sets this MSR. The next patch in this series, will allow to disable this special behavior, making host TSC_ADJUST write work the same way as in guest. Therefore to still allow to set TSC_ADJUST and TSC independently after migration this ioctl will be used instead. Best regards, Maxim Levitsky > > Thanks, > > tglx >
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 70254eaa5229f..2f04aa8ecf119 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4826,6 +4826,62 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may experience inconsistent filtering behavior on MSR accesses. +4.127 KVM_GET_TSC_STATE +---------------------------- + +:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_info +:Returns: 0 on success, < 0 on error + +:: + + #define KVM_TSC_INFO_TSC_ADJUST_VALID 1 + struct kvm_tsc_info { + __u32 flags; + __u64 nsec; + __u64 tsc; + __u64 tsc_adjust; + }; + +flags values for ``struct kvm_tsc_info``: + +``KVM_TSC_INFO_TSC_ADJUST_VALID`` + + ``tsc_adjust`` contains valid IA32_TSC_ADJUST value + +This ioctl allows user space to read guest's IA32_TSC, IA32_TSC_ADJUST, +and the current value of host CLOCK_REALTIME clock in nanoseconds since unix +epoch. + + +4.128 KVM_SET_TSC_STATE +---------------------------- + +:Capability: KVM_CAP_PRECISE_TSC +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_tsc_info +:Returns: 0 on success, < 0 on error + +:: + +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU. + +KVM will adjust the guest TSC value by the time that passed between +CLOCK_REALTIME timestamp saved in the struct and current value of +CLOCK_REALTIME, and set guest's TSC to the new value. + +TSC_ADJUST is restored as is if KVM_TSC_INFO_TSC_ADJUST_VALID is set. + +It is assumed that either both ioctls will be run on the same machine, +or that source and destination machines have synchronized clocks. + +As a special case, it is allowed to leave the timestamp in the struct to zero, +in which case it will be ignored and the TSC will be restored exactly. + 5. The kvm_run structure ======================== diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a3fdc16cfd6f3..4f0ae9cb14b8a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts, return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp)); } + + +static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc) +{ + struct timespec64 ts; + + if (kvm_get_walltime_and_clockread(&ts, host_tsc)) { + *walltime_ns = timespec64_to_ns(&ts); + return; + } + + *host_tsc = rdtsc(); + *walltime_ns = ktime_get_real_ns(); +} + #endif /* @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_X86_USER_SPACE_MSR: case KVM_CAP_X86_MSR_FILTER: case KVM_CAP_ENFORCE_PV_FEATURE_CPUID: +#ifdef CONFIG_X86_64 + case KVM_CAP_PRECISE_TSC: +#endif r = 1; break; case KVM_CAP_SYNC_REGS: @@ -4999,6 +5017,57 @@ long kvm_arch_vcpu_ioctl(struct file *filp, case KVM_GET_SUPPORTED_HV_CPUID: r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp); break; +#ifdef CONFIG_X86_64 + case KVM_GET_TSC_STATE: { + struct kvm_tsc_state __user *user_tsc_state = argp; + struct kvm_tsc_state tsc_state; + u64 host_tsc; + + memset(&tsc_state, 0, sizeof(tsc_state)); + + kvm_get_walltime(&tsc_state.nsec, &host_tsc); + tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc); + + if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) { + tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr; + tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID; + } + + r = -EFAULT; + if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state))) + goto out; + r = 0; + break; + } + case KVM_SET_TSC_STATE: { + struct kvm_tsc_state __user *user_tsc_state = argp; + struct kvm_tsc_state tsc_state; + + u64 host_tsc, wall_nsec; + s64 diff; + u64 new_guest_tsc, new_guest_tsc_offset; + + r = -EFAULT; + if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state))) + goto out; + + kvm_get_walltime(&wall_nsec, &host_tsc); + diff = wall_nsec - tsc_state.nsec; + + if (diff < 0 || tsc_state.nsec == 0) + diff = 0; + + new_guest_tsc = tsc_state.tsc + nsec_to_cycles(vcpu, diff); + new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc); + kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset); + + if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID) + if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) + vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust; + r = 0; + break; + } +#endif default: r = -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 886802b8ffba3..ee1bd5e7da964 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190 #define KVM_CAP_SYS_HYPERV_CPUID 191 #define KVM_CAP_DIRTY_LOG_RING 192 +#define KVM_CAP_PRECISE_TSC 193 #ifdef KVM_CAP_IRQ_ROUTING @@ -1169,6 +1170,15 @@ struct kvm_clock_data { __u32 pad[9]; }; + +#define KVM_TSC_STATE_TSC_ADJUST_VALID 1 +struct kvm_tsc_state { + __u32 flags; + __u64 nsec; + __u64 tsc; + __u64 tsc_adjust; +}; + /* For KVM_CAP_SW_TLB */ #define KVM_MMU_FSL_BOOKE_NOHV 0 @@ -1563,6 +1573,10 @@ struct kvm_pv_cmd { /* Available with KVM_CAP_DIRTY_LOG_RING */ #define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc7) +/* Available with KVM_CAP_PRECISE_TSC*/ +#define KVM_SET_TSC_STATE _IOW(KVMIO, 0xc8, struct kvm_tsc_state) +#define KVM_GET_TSC_STATE _IOR(KVMIO, 0xc9, struct kvm_tsc_state) + /* Secure Encrypted Virtualization command */ enum sev_cmd_id { /* Guest initialization commands */
These two new ioctls allow to more precisly capture and restore guest's TSC state. Both ioctls are meant to be used to accurately migrate guest TSC even when there is a significant downtime during the migration. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> --- Documentation/virt/kvm/api.rst | 56 +++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 69 ++++++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 14 +++++++ 3 files changed, 139 insertions(+)