Message ID | 20230913072113.78885-1-likexu@tencent.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v5] KVM: x86/tsc: Don't sync TSC on the first write in state restoration | expand |
On Wed, 2023-09-13 at 15:21 +0800, Like Xu wrote: > From: Like Xu <likexu@tencent.com> > > Add kvm->arch.user_set_tsc to avoid synchronization on the first write > from userspace so as not to misconstrue state restoration after live > migration as an attempt from userspace to synchronize. More precisely, > the problem is that the sync code doesn't differentiate between userspace > initializing the TSC and userspace attempting to synchronize the TSC. That commit message definitely needs work. Oliver gave you some verbiage which made a lot more sense to me. Let me try again... ================ [PATCH] KVM: x86/tsc: Don't sync user-written TSC against startup values The legacy API for setting the TSC is fundamentally broken, and only allows userspace to set a TSC "now", without any way to account for time lost to preemption between the calculation of the value, and the kernel eventually handling the ioctl. To work around this we have had a hack which, if a TSC is set with a value which is within a second's worth of a previous vCPU, assumes that userspace actually intended them to be in sync and adjusts the newly- written TSC value accordingly. Thus, when a VMM restores a guest after suspend or migration using the legacy API, the TSCs aren't necessarily *right*, but at least they're in sync. This trick falls down when restoring a guest which genuinely has been running for less time than the 1 second of imprecision which we allow for in the legacy API. On *creation* the first vCPU starts its TSC counting from zero, and the subsequent vCPUs synchronize to that. But then when the VMM tries to set the intended TSC value, because that's within a second of what the last TSC synced to, it just adjusts it to match that. The correct answer is for the VMM not to use the legacy API of course. But we can pile further hacks onto our existing hackish ABI, and declare that the *first* value written by userspace (on any vCPU) should not be subject to this 'correction' to make it sync up with values that only from from the kernel's default vCPU creation. To that end: Add a flag in kvm->arch.user_set_tsc, protected by kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in this KVM *has* been set by userspace. Make the 1-second slop hack only trigger if that flag is already set. =================== I think you also need to set kvm->arch.user_set_tsc() in kvm_arch_tsc_set_attr(), don't you? > Reported-by: Yong He <alexyonghe@tencent.com> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423 > Suggested-by: Sean Christopherson <seanjc@google.com> > Suggested-by: Oliver Upton <oliver.upton@linux.dev> > Original-by: Sean Christopherson <seanjc@google.com> > Tested-by: Like Xu <likexu@tencent.com> > Signed-off-by: Like Xu <likexu@tencent.com> > --- > V4 -> V5 Changelog: > - Making kvm_synchronize_tsc(@data) a pointer and passing NULL; (Sean) > - Refine commit message in a more accurate way; (Sean) > V4: https://lore.kernel.org/kvm/20230801034524.64007-1-likexu@tencent.com/ > > arch/x86/include/asm/kvm_host.h | 1 + > arch/x86/kvm/x86.c | 25 ++++++++++++++++--------- > 2 files changed, 17 insertions(+), 9 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 1a4def36d5bb..9a7dfef9d32d 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1324,6 +1324,7 @@ struct kvm_arch { > int nr_vcpus_matched_tsc; > > u32 default_tsc_khz; > + bool user_set_tsc; > > seqcount_raw_spinlock_t pvclock_sc; > bool use_master_clock; > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 6c9c81e82e65..0fef6ed69cbb 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2714,8 +2714,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc, > kvm_track_tsc_matching(vcpu); > } > > -static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) > +static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value) > { > + u64 data = user_value ? *user_value : 0; > struct kvm *kvm = vcpu->kvm; > u64 offset, ns, elapsed; > unsigned long flags; > @@ -2728,14 +2729,17 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) > elapsed = ns - kvm->arch.last_tsc_nsec; > > if (vcpu->arch.virtual_tsc_khz) { > + /* > + * Force synchronization when creating or hotplugging a vCPU, > + * i.e. when the TSC value is '0', to help keep clocks stable. > + * If this is NOT a hotplug/creation case, skip synchronization > + * on the first write from userspace so as not to misconstrue > + * state restoration after live migration as an attempt from > + * userspace to synchronize. This comment isn't quite right; it wants to use some excerpt of the commit message I've suggested above. > + */ > if (data == 0) { > - /* > - * detection of vcpu initialization -- need to sync > - * with other vCPUs. This particularly helps to keep > - * kvm_clock stable after CPU hotplug > - */ > synchronizing = true; > - } else { > + } else if (kvm->arch.user_set_tsc) { > u64 tsc_exp = kvm->arch.last_tsc_write + > nsec_to_cycles(vcpu, elapsed); > u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; > @@ -2749,6 +2753,9 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) > } > } > > + if (user_value) > + kvm->arch.user_set_tsc = true; > + > /* > * For a reliable TSC, we can match TSC offsets, and for an unstable > * TSC, we add elapsed time in this computation. We could let the > @@ -3777,7 +3784,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > break; > case MSR_IA32_TSC: > if (msr_info->host_initiated) { > - kvm_synchronize_tsc(vcpu, data); > + kvm_synchronize_tsc(vcpu, &data); Userspace used to be able to write zero to force a sync. You've removed that ability from the ABI, and haven't even mentioned it. Must we? > } else { > u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset; > adjust_tsc_offset_guest(vcpu, adj); > @@ -11959,7 +11966,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) > if (mutex_lock_killable(&vcpu->mutex)) > return; > vcpu_load(vcpu); > - kvm_synchronize_tsc(vcpu, 0); > + kvm_synchronize_tsc(vcpu, NULL); > vcpu_put(vcpu); > > /* poll control enabled by default */ > > base-commit: 0bb80ecc33a8fb5a682236443c1e740d5c917d1d
On 13/9/2023 4:37 pm, David Woodhouse wrote: > On Wed, 2023-09-13 at 15:21 +0800, Like Xu wrote: >> From: Like Xu <likexu@tencent.com> >> >> Add kvm->arch.user_set_tsc to avoid synchronization on the first write >> from userspace so as not to misconstrue state restoration after live >> migration as an attempt from userspace to synchronize. More precisely, >> the problem is that the sync code doesn't differentiate between userspace >> initializing the TSC and userspace attempting to synchronize the TSC. > > > That commit message definitely needs work. Oliver gave you some > verbiage which made a lot more sense to me. Let me try again... Thank you for reviewing it. > > ================ > > [PATCH] KVM: x86/tsc: Don't sync user-written TSC against startup values > > The legacy API for setting the TSC is fundamentally broken, and only > allows userspace to set a TSC "now", without any way to account for > time lost to preemption between the calculation of the value, and the > kernel eventually handling the ioctl. > > To work around this we have had a hack which, if a TSC is set with a > value which is within a second's worth of a previous vCPU, assumes that > userspace actually intended them to be in sync and adjusts the newly- > written TSC value accordingly. > > Thus, when a VMM restores a guest after suspend or migration using the > legacy API, the TSCs aren't necessarily *right*, but at least they're > in sync. > > This trick falls down when restoring a guest which genuinely has been > running for less time than the 1 second of imprecision which we allow > for in the legacy API. On *creation* the first vCPU starts its TSC > counting from zero, and the subsequent vCPUs synchronize to that. But > then when the VMM tries to set the intended TSC value, because that's > within a second of what the last TSC synced to, it just adjusts it to > match that. > > The correct answer is for the VMM not to use the legacy API of course. > > But we can pile further hacks onto our existing hackish ABI, and > declare that the *first* value written by userspace (on any vCPU) > should not be subject to this 'correction' to make it sync up with > values that only from from the kernel's default vCPU creation. > > To that end: Add a flag in kvm->arch.user_set_tsc, protected by > kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in > this KVM *has* been set by userspace. Make the 1-second slop hack only > trigger if that flag is already set. > > =================== > > I think you also need to set kvm->arch.user_set_tsc() in > kvm_arch_tsc_set_attr(), don't you? How about: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c55cc60769db..374965f66137 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5545,6 +5545,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset; ns = get_kvmclock_base_ns(); + kvm->arch.user_set_tsc = true; __kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched); raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags); > > >> Reported-by: Yong He <alexyonghe@tencent.com> >> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423 >> Suggested-by: Sean Christopherson <seanjc@google.com> >> Suggested-by: Oliver Upton <oliver.upton@linux.dev> >> Original-by: Sean Christopherson <seanjc@google.com> >> Tested-by: Like Xu <likexu@tencent.com> >> Signed-off-by: Like Xu <likexu@tencent.com> >> --- >> V4 -> V5 Changelog: >> - Making kvm_synchronize_tsc(@data) a pointer and passing NULL; (Sean) >> - Refine commit message in a more accurate way; (Sean) >> V4: https://lore.kernel.org/kvm/20230801034524.64007-1-likexu@tencent.com/ >> >> arch/x86/include/asm/kvm_host.h | 1 + >> arch/x86/kvm/x86.c | 25 ++++++++++++++++--------- >> 2 files changed, 17 insertions(+), 9 deletions(-) >> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h >> index 1a4def36d5bb..9a7dfef9d32d 100644 >> --- a/arch/x86/include/asm/kvm_host.h >> +++ b/arch/x86/include/asm/kvm_host.h >> @@ -1324,6 +1324,7 @@ struct kvm_arch { >> int nr_vcpus_matched_tsc; >> >> u32 default_tsc_khz; >> + bool user_set_tsc; >> >> seqcount_raw_spinlock_t pvclock_sc; >> bool use_master_clock; >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index 6c9c81e82e65..0fef6ed69cbb 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -2714,8 +2714,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc, >> kvm_track_tsc_matching(vcpu); >> } >> >> -static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) >> +static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value) >> { >> + u64 data = user_value ? *user_value : 0; >> struct kvm *kvm = vcpu->kvm; >> u64 offset, ns, elapsed; >> unsigned long flags; >> @@ -2728,14 +2729,17 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) >> elapsed = ns - kvm->arch.last_tsc_nsec; >> >> if (vcpu->arch.virtual_tsc_khz) { >> + /* >> + * Force synchronization when creating or hotplugging a vCPU, >> + * i.e. when the TSC value is '0', to help keep clocks stable. >> + * If this is NOT a hotplug/creation case, skip synchronization >> + * on the first write from userspace so as not to misconstrue >> + * state restoration after live migration as an attempt from >> + * userspace to synchronize. > > This comment isn't quite right; it wants to use some excerpt of the > commit message I've suggested above. How about: @@ -2735,20 +2735,34 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) * kvm_clock stable after CPU hotplug */ synchronizing = true; - } else { + } else if (kvm->arch.user_set_tsc) { u64 tsc_exp = kvm->arch.last_tsc_write + nsec_to_cycles(vcpu, elapsed); u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; /* - * Special case: TSC write with a small delta (1 second) - * of virtual cycle time against real time is - * interpreted as an attempt to synchronize the CPU. + * Here lies UAPI baggage: user-initiated TSC write with + * a small delta (1 second) of virtual cycle time + * against real time is interpreted as an attempt to + * synchronize the CPU. + * + * This trick falls down when restoring a guest which genuinely + * has been running for less time than the 1 second of imprecision + * which we allow for in the legacy API. In this case, the first + * value written by userspace (on any vCPU) should not be subject + * to this 'correction' to make it sync up with values that only + * from from the kernel's default vCPU creation. Make the 1-second + * slop hack only trigger if flag is already set. + * + * The correct answer is for the VMM not to use the legacy API. */ synchronizing = data < tsc_exp + tsc_hz && data + tsc_hz > tsc_exp; } } > >> + */ >> if (data == 0) { >> - /* >> - * detection of vcpu initialization -- need to sync >> - * with other vCPUs. This particularly helps to keep >> - * kvm_clock stable after CPU hotplug >> - */ >> synchronizing = true; >> - } else { >> + } else if (kvm->arch.user_set_tsc) { >> u64 tsc_exp = kvm->arch.last_tsc_write + >> nsec_to_cycles(vcpu, elapsed); >> u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; >> @@ -2749,6 +2753,9 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) >> } >> } >> >> + if (user_value) >> + kvm->arch.user_set_tsc = true; >> + >> /* >> * For a reliable TSC, we can match TSC offsets, and for an unstable >> * TSC, we add elapsed time in this computation. We could let the >> @@ -3777,7 +3784,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) >> break; >> case MSR_IA32_TSC: >> if (msr_info->host_initiated) { >> - kvm_synchronize_tsc(vcpu, data); >> + kvm_synchronize_tsc(vcpu, &data); > > Userspace used to be able to write zero to force a sync. You've removed > that ability from the ABI, and haven't even mentioned it. Must we? Will continue to use "bool user_initiated" for lack of a better move. > >> } else { >> u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset; >> adjust_tsc_offset_guest(vcpu, adj); >> @@ -11959,7 +11966,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) >> if (mutex_lock_killable(&vcpu->mutex)) >> return; >> vcpu_load(vcpu); >> - kvm_synchronize_tsc(vcpu, 0); >> + kvm_synchronize_tsc(vcpu, NULL); >> vcpu_put(vcpu); >> >> /* poll control enabled by default */ >> >> base-commit: 0bb80ecc33a8fb5a682236443c1e740d5c917d1d >
On 13 September 2023 11:17:49 CEST, Like Xu <like.xu.linux@gmail.com> wrote: >> I think you also need to set kvm->arch.user_set_tsc() in >> kvm_arch_tsc_set_attr(), don't you? > >How about: > >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >index c55cc60769db..374965f66137 100644 >--- a/arch/x86/kvm/x86.c >+++ b/arch/x86/kvm/x86.c >@@ -5545,6 +5545,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, > tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset; > ns = get_kvmclock_base_ns(); > >+ kvm->arch.user_set_tsc = true; > __kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched); > raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags); Yep, that looks good. >> This comment isn't quite right; it wants to use some excerpt of the >> commit message I've suggested above. > >How about: > >@@ -2735,20 +2735,34 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) > * kvm_clock stable after CPU hotplug > */ > synchronizing = true; >- } else { >+ } else if (kvm->arch.user_set_tsc) { > u64 tsc_exp = kvm->arch.last_tsc_write + > nsec_to_cycles(vcpu, elapsed); > u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; > /* >- * Special case: TSC write with a small delta (1 second) >- * of virtual cycle time against real time is >- * interpreted as an attempt to synchronize the CPU. >+ * Here lies UAPI baggage: user-initiated TSC write with >+ * a small delta (1 second) of virtual cycle time >+ * against real time is interpreted as an attempt to >+ * synchronize the CPU. Much better, thanks. But I don't much like "an attempt to synchronize the CPU". In my response to Sean I objected to that classification. Userspace is just *setting* the TSC. There is no dedicated intent to "synchronize" it. It just sets it, and the value just *might* happen to be in sync with another vCPU. It's just that our API is so fundamentally broken that it *can't* be in sync, so we "help" it a bit if it looks close. So maybe... Here lies UAPI baggage: when a user-initiated TSC write has a small delta (1 second) of virtual cycle time against the previously set vCPU, we assume that they were intended to be in sync and the delta was only due to the racy nature of the legacy API. >+ * This trick falls down when restoring a guest which genuinely >+ * has been running for less time than the 1 second of imprecision >+ * which we allow for in the legacy API. In this case, the first >+ * value written by userspace (on any vCPU) should not be subject >+ * to this 'correction' to make it sync up with values that only >+ * from from the kernel's default vCPU creation. Make the 1-second >+ * slop hack only trigger if flag is already set. >+ * >+ * The correct answer is for the VMM not to use the legacy API. >> Userspace used to be able to write zero to force a sync. You've removed >> that ability from the ABI, and haven't even mentioned it. Must we? > >Will continue to use "bool user_initiated" for lack of a better move. Why? Can't we treat an explicit zero write just the same as when the kernel does it?
On 13/9/2023 5:30 pm, David Woodhouse wrote: > > > On 13 September 2023 11:17:49 CEST, Like Xu <like.xu.linux@gmail.com> wrote: >>> I think you also need to set kvm->arch.user_set_tsc() in >>> kvm_arch_tsc_set_attr(), don't you? >> >> How about: >> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index c55cc60769db..374965f66137 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -5545,6 +5545,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, >> tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset; >> ns = get_kvmclock_base_ns(); >> >> + kvm->arch.user_set_tsc = true; >> __kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched); >> raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags); > > Yep, that looks good. > > >>> This comment isn't quite right; it wants to use some excerpt of the >>> commit message I've suggested above. >> >> How about: >> >> @@ -2735,20 +2735,34 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) >> * kvm_clock stable after CPU hotplug >> */ >> synchronizing = true; >> - } else { >> + } else if (kvm->arch.user_set_tsc) { >> u64 tsc_exp = kvm->arch.last_tsc_write + >> nsec_to_cycles(vcpu, elapsed); >> u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; >> /* >> - * Special case: TSC write with a small delta (1 second) >> - * of virtual cycle time against real time is >> - * interpreted as an attempt to synchronize the CPU. >> + * Here lies UAPI baggage: user-initiated TSC write with >> + * a small delta (1 second) of virtual cycle time >> + * against real time is interpreted as an attempt to >> + * synchronize the CPU. > > Much better, thanks. But I don't much like "an attempt to synchronize the CPU". > > In my response to Sean I objected to that classification. Userspace is just *setting* the TSC. There is no dedicated intent to "synchronize" it. It just sets it, and the value just *might* happen to be in sync with another vCPU. > > It's just that our API is so fundamentally broken that it *can't* be in sync, so we "help" it a bit if it looks close. > > So maybe... > > Here lies UAPI baggage: when a user-initiated TSC write has a small delta (1 second) of virtual cycle time against the previously set vCPU, we assume that they were intended to be in sync and the delta was only due to the racy nature of the legacy API. > >> + * This trick falls down when restoring a guest which genuinely >> + * has been running for less time than the 1 second of imprecision >> + * which we allow for in the legacy API. In this case, the first >> + * value written by userspace (on any vCPU) should not be subject >> + * to this 'correction' to make it sync up with values that only >> + * from from the kernel's default vCPU creation. Make the 1-second >> + * slop hack only trigger if flag is already set. >> + * >> + * The correct answer is for the VMM not to use the legacy API. > > > > >>> Userspace used to be able to write zero to force a sync. You've removed >>> that ability from the ABI, and haven't even mentioned it. Must we? >> >> Will continue to use "bool user_initiated" for lack of a better move. > > Why? Can't we treat an explicit zero write just the same as when the kernel does it? Not sure if it meets your simplified expectations: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6c9c81e82e65..0f05cf90d636 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2735,20 +2735,35 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) * kvm_clock stable after CPU hotplug */ synchronizing = true; - } else { + } else if (!data || kvm->arch.user_set_tsc) { u64 tsc_exp = kvm->arch.last_tsc_write + nsec_to_cycles(vcpu, elapsed); u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; /* - * Special case: TSC write with a small delta (1 second) - * of virtual cycle time against real time is - * interpreted as an attempt to synchronize the CPU. + * Here lies UAPI baggage: when a user-initiated TSC write has + * a small delta (1 second) of virtual cycle time against the + * previously set vCPU, we assume that they were intended to be + * in sync and the delta was only due to the racy nature of the + * legacy API. + * + * This trick falls down when restoring a guest which genuinely + * has been running for less time than the 1 second of imprecision + * which we allow for in the legacy API. In this case, the first + * value written by userspace (on any vCPU) should not be subject + * to this 'correction' to make it sync up with values that only + * from from the kernel's default vCPU creation. Make the 1-second + * slop hack only trigger if flag is already set. + * + * The correct answer is for the VMM not to use the legacy API. */ synchronizing = data < tsc_exp + tsc_hz && data + tsc_hz > tsc_exp; } } + if (data) + kvm->arch.user_set_tsc = true; + /* * For a reliable TSC, we can match TSC offsets, and for an unstable * TSC, we add elapsed time in this computation. We could let the @@ -5536,6 +5551,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset; ns = get_kvmclock_base_ns(); + kvm->arch.user_set_tsc = true; __kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched); raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
On 13 September 2023 11:43:56 CEST, Like Xu <like.xu.linux@gmail.com> wrote: >> Why? Can't we treat an explicit zero write just the same as when the kernel does it? > >Not sure if it meets your simplified expectations: Think that looks good, thanks. One minor nit... >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >index 6c9c81e82e65..0f05cf90d636 100644 >--- a/arch/x86/kvm/x86.c >+++ b/arch/x86/kvm/x86.c >@@ -2735,20 +2735,35 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) > * kvm_clock stable after CPU hotplug > */ > synchronizing = true; >- } else { >+ } else if (!data || kvm->arch.user_set_tsc) { If data is zero here, won't the first if() case have been taken, and set synchronizing=true? So this is equivalent to "else if (kvm->arch.user_set_tsc)". (Which is fine and what what I intended). > u64 tsc_exp = kvm->arch.last_tsc_write + > nsec_to_cycles(vcpu, elapsed); > u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; > /* >- * Special case: TSC write with a small delta (1 second) >- * of virtual cycle time against real time is >- * interpreted as an attempt to synchronize the CPU. >+ * Here lies UAPI baggage: when a user-initiated TSC write has >+ * a small delta (1 second) of virtual cycle time against the >+ * previously set vCPU, we assume that they were intended to be >+ * in sync and the delta was only due to the racy nature of the >+ * legacy API. >+ * >+ * This trick falls down when restoring a guest which genuinely >+ * has been running for less time than the 1 second of imprecision >+ * which we allow for in the legacy API. In this case, the first >+ * value written by userspace (on any vCPU) should not be subject >+ * to this 'correction' to make it sync up with values that only >+ * from from the kernel's default vCPU creation. Make the 1-second >+ * slop hack only trigger if flag is already set. >+ * >+ * The correct answer is for the VMM not to use the legacy API. > */ > synchronizing = data < tsc_exp + tsc_hz && > data + tsc_hz > tsc_exp; > } > } > >+ if (data) >+ kvm->arch.user_set_tsc = true; >+ > /* > * For a reliable TSC, we can match TSC offsets, and for an unstable > * TSC, we add elapsed time in this computation. We could let the >@@ -5536,6 +5551,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, > tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset; > ns = get_kvmclock_base_ns(); > >+ kvm->arch.user_set_tsc = true; > __kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched); > raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags); > >
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1a4def36d5bb..9a7dfef9d32d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1324,6 +1324,7 @@ struct kvm_arch { int nr_vcpus_matched_tsc; u32 default_tsc_khz; + bool user_set_tsc; seqcount_raw_spinlock_t pvclock_sc; bool use_master_clock; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6c9c81e82e65..0fef6ed69cbb 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2714,8 +2714,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc, kvm_track_tsc_matching(vcpu); } -static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) +static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value) { + u64 data = user_value ? *user_value : 0; struct kvm *kvm = vcpu->kvm; u64 offset, ns, elapsed; unsigned long flags; @@ -2728,14 +2729,17 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) elapsed = ns - kvm->arch.last_tsc_nsec; if (vcpu->arch.virtual_tsc_khz) { + /* + * Force synchronization when creating or hotplugging a vCPU, + * i.e. when the TSC value is '0', to help keep clocks stable. + * If this is NOT a hotplug/creation case, skip synchronization + * on the first write from userspace so as not to misconstrue + * state restoration after live migration as an attempt from + * userspace to synchronize. + */ if (data == 0) { - /* - * detection of vcpu initialization -- need to sync - * with other vCPUs. This particularly helps to keep - * kvm_clock stable after CPU hotplug - */ synchronizing = true; - } else { + } else if (kvm->arch.user_set_tsc) { u64 tsc_exp = kvm->arch.last_tsc_write + nsec_to_cycles(vcpu, elapsed); u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; @@ -2749,6 +2753,9 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) } } + if (user_value) + kvm->arch.user_set_tsc = true; + /* * For a reliable TSC, we can match TSC offsets, and for an unstable * TSC, we add elapsed time in this computation. We could let the @@ -3777,7 +3784,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) break; case MSR_IA32_TSC: if (msr_info->host_initiated) { - kvm_synchronize_tsc(vcpu, data); + kvm_synchronize_tsc(vcpu, &data); } else { u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset; adjust_tsc_offset_guest(vcpu, adj); @@ -11959,7 +11966,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) if (mutex_lock_killable(&vcpu->mutex)) return; vcpu_load(vcpu); - kvm_synchronize_tsc(vcpu, 0); + kvm_synchronize_tsc(vcpu, NULL); vcpu_put(vcpu); /* poll control enabled by default */