Message ID | 20220414183127.4080873-1-romanton@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: Use current rather than snapshotted TSC frequency if it is constant | expand |
+Vitaly On Thu, Apr 14, 2022, Anton Romanov wrote: > Don't snapshot tsc_khz into per-cpu cpu_tsc_khz if the > host TSC is constant, in which case the actual TSC frequency will never > change and thus capturing TSC during initialization is > unnecessary, KVM can simply use tsc_khz. > This value is snapshotted from > kvm_timer_init->kvmclock_cpu_online->tsc_khz_changed(NULL) Nit, please wrap changelogs at ~75 chars. It's not a hard rule, e.g. if running over or cutting early improves readability, then by all means. But wrapping somewhat randomly makes reading the changelog unnecessarily difficult. > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 547ba00ef64f..4ae9a03f549d 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2907,6 +2907,19 @@ static void kvm_update_masterclock(struct kvm *kvm) > kvm_end_pvclock_update(kvm); > } > > +/* > + * If kvm is built into kernel it is possible that tsc_khz saved into > + * per-cpu cpu_tsc_khz was yet unrefined value. If CPU provides CONSTANT_TSC it > + * doesn't make sense to snapshot it anyway so just return tsc_khz > + */ > +static unsigned long get_cpu_tsc_khz(void) > +{ > + if (static_cpu_has(X86_FEATURE_CONSTANT_TSC)) > + return tsc_khz; > + else > + return __this_cpu_read(cpu_tsc_khz); > +} > + > /* Called within read_seqcount_begin/retry for kvm->pvclock_sc. */ > static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) > { > @@ -2917,7 +2930,7 @@ static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) > get_cpu(); > > data->flags = 0; > - if (ka->use_master_clock && __this_cpu_read(cpu_tsc_khz)) { > + if (ka->use_master_clock && get_cpu_tsc_khz()) { It might make sense to open code this to make it more obvious why the "else" path exists. That'd also eliminate a condition branch on CPUs with a constant TSC, though I don't know if we care that much about the performance here. if (ka->use_master_clock && (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) And/or add a comment about cpu_tsc_khz being zero when the CPU is being offlined? > #ifdef CONFIG_X86_64 > struct timespec64 ts; > ... > @@ -8646,9 +8659,12 @@ static void tsc_khz_changed(void *data) > struct cpufreq_freqs *freq = data; > unsigned long khz = 0; > > + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) > + return; Vitaly, The Hyper-V guest code also sets cpu_tsc_khz, should we WARN if that notifier is invoked and Hyper-V told us there's a constant TSC? diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ab336f7c82e4..ca8e20f5ffc0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8701,6 +8701,8 @@ static void kvm_hyperv_tsc_notifier(void) struct kvm *kvm; int cpu; + WARN_ON_ONCE(boot_cpu_has(X86_FEATURE_CONSTANT_TSC)); + mutex_lock(&kvm_lock); list_for_each_entry(kvm, &vm_list, vm_list) kvm_make_mclock_inprogress_request(kvm);
On Thu, Apr 14, 2022 at 12:33 PM Sean Christopherson <seanjc@google.com> wrote: > On Thu, Apr 14, 2022, Anton Romanov wrote: > > /* Called within read_seqcount_begin/retry for kvm->pvclock_sc. */ > > static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) > > { > > @@ -2917,7 +2930,7 @@ static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) > > get_cpu(); > > > > data->flags = 0; > > - if (ka->use_master_clock && __this_cpu_read(cpu_tsc_khz)) { > > + if (ka->use_master_clock && get_cpu_tsc_khz()) { > > It might make sense to open code this to make it more obvious why the "else" path > exists. That'd also eliminate a condition branch on CPUs with a constant TSC, > though I don't know if we care that much about the performance here. > > if (ka->use_master_clock && > (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) > > And/or add a comment about cpu_tsc_khz being zero when the CPU is being offlined? It looks like cpu_tsc_khz being zero is used as an indicator of CPU being unplugged here. I don't think your proposed change is right in this case either. How about we still keep tsc_khz_changed untouched as well as this line? Potentially adding here a comment that on this line we only read it to see if CPU is not being unplugged yet
On Thu, Apr 14, 2022, Anton Romanov wrote: > On Thu, Apr 14, 2022 at 12:33 PM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Apr 14, 2022, Anton Romanov wrote: > > > /* Called within read_seqcount_begin/retry for kvm->pvclock_sc. */ > > > static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) > > > { > > > @@ -2917,7 +2930,7 @@ static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) > > > get_cpu(); > > > > > > data->flags = 0; > > > - if (ka->use_master_clock && __this_cpu_read(cpu_tsc_khz)) { > > > + if (ka->use_master_clock && get_cpu_tsc_khz()) { > > > > It might make sense to open code this to make it more obvious why the "else" path > > exists. That'd also eliminate a condition branch on CPUs with a constant TSC, > > though I don't know if we care that much about the performance here. > > > > if (ka->use_master_clock && > > (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) > > > > And/or add a comment about cpu_tsc_khz being zero when the CPU is being offlined? > > It looks like cpu_tsc_khz being zero is used as an indicator of CPU > being unplugged here. That's ok, the unplug issue was that kvm_get_time_scale() got stuck in an infinite loop due to cpu_tsc_khz being zero. Using tsc_khz is ok even though the CPU is about to go offline, the CPU going offline doesn't make the calculation wrong. > I don't think your proposed change is right in this case either. > How about we still keep tsc_khz_changed untouched as well as this line? > Potentially adding here a comment that on this line we only read it to > see if CPU is not being unplugged yet
Sean Christopherson <seanjc@google.com> writes: > +Vitaly > > On Thu, Apr 14, 2022, Anton Romanov wrote: ... >> @@ -8646,9 +8659,12 @@ static void tsc_khz_changed(void *data) >> struct cpufreq_freqs *freq = data; >> unsigned long khz = 0; >> >> + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) >> + return; > > Vitaly, > > The Hyper-V guest code also sets cpu_tsc_khz, should we WARN if that notifier is > invoked and Hyper-V told us there's a constant TSC? > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index ab336f7c82e4..ca8e20f5ffc0 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -8701,6 +8701,8 @@ static void kvm_hyperv_tsc_notifier(void) > struct kvm *kvm; > int cpu; > > + WARN_ON_ONCE(boot_cpu_has(X86_FEATURE_CONSTANT_TSC)); > + > mutex_lock(&kvm_lock); > list_for_each_entry(kvm, &vm_list, vm_list) > kvm_make_mclock_inprogress_request(kvm); > (apologies for the delayed reply) No, I think Hyper-V's "Reenlightenment" feature overrides (re-defines?) X86_FEATURE_CONSTANT_TSC. E.g. I've checked a VM on E5-2667 v4 (Broadwell) CPU with no TSC scaling. This VM has 'constant_tsc' and will certainly get reenlightenment irq on migration. Note, Hyper-V has its own 'Invariant TSC control', see commit dce7cd62754b5 ("x86/hyperv: Allow guests to enable InvariantTSC"). When enabled, X86_FEATURE_TSC_RELIABLE is forced. I *think* (haven't checked as I don't have two suitable hosts to test migration handy) this will suppress reenlightenment so the check should be WARN_ON_ONCE(boot_cpu_has(X86_FEATURE_TSC_RELIABLE)); instead. There is a chance that reenlightenment notifications still arrive but the reported 'new' TSC frequency remains unchanged (silly, but possible), I'll need to check.
On Tue, Apr 19, 2022, Vitaly Kuznetsov wrote: > Sean Christopherson <seanjc@google.com> writes: > > > +Vitaly > > > > On Thu, Apr 14, 2022, Anton Romanov wrote: > > ... > > >> @@ -8646,9 +8659,12 @@ static void tsc_khz_changed(void *data) > >> struct cpufreq_freqs *freq = data; > >> unsigned long khz = 0; > >> > >> + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) > >> + return; > > > > Vitaly, > > > > The Hyper-V guest code also sets cpu_tsc_khz, should we WARN if that notifier is > > invoked and Hyper-V told us there's a constant TSC? > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index ab336f7c82e4..ca8e20f5ffc0 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -8701,6 +8701,8 @@ static void kvm_hyperv_tsc_notifier(void) > > struct kvm *kvm; > > int cpu; > > > > + WARN_ON_ONCE(boot_cpu_has(X86_FEATURE_CONSTANT_TSC)); > > + > > mutex_lock(&kvm_lock); > > list_for_each_entry(kvm, &vm_list, vm_list) > > kvm_make_mclock_inprogress_request(kvm); > > > > (apologies for the delayed reply) > > No, I think Hyper-V's "Reenlightenment" feature overrides (re-defines?) > X86_FEATURE_CONSTANT_TSC. E.g. I've checked a VM on E5-2667 v4 > (Broadwell) CPU with no TSC scaling. This VM has 'constant_tsc' and will > certainly get reenlightenment irq on migration. Ooh, so that a VM with a constant TSC be live migrated to another system with a constant, but different, TSC. Does the below look correct as fixup for this patch? diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ab336f7c82e4..a944e4ba5532 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8708,10 +8708,18 @@ static void kvm_hyperv_tsc_notifier(void) /* no guest entries from this point */ hyperv_stop_tsc_emulation(); - /* TSC frequency always matches when on Hyper-V */ - for_each_present_cpu(cpu) - per_cpu(cpu_tsc_khz, cpu) = tsc_khz; - kvm_max_guest_tsc_khz = tsc_khz; + /* + * TSC frequency always matches when on Hyper-V. Skip the updates if + * the TSC is "officially" constant, in which case KVM doesn't use the + * per-CPU and max variables. Note, the notifier can still fire with + * a constant TSC, e.g. if this VM (KVM is a Hyper-V guest) is migrated + * to a system with a different TSC frequency. + */ + if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) { + for_each_present_cpu(cpu) + per_cpu(cpu_tsc_khz, cpu) = tsc_khz; + kvm_max_guest_tsc_khz = tsc_khz; + } list_for_each_entry(kvm, &vm_list, vm_list) { __kvm_start_pvclock_update(kvm);
Sean Christopherson <seanjc@google.com> writes: > On Tue, Apr 19, 2022, Vitaly Kuznetsov wrote: >> Sean Christopherson <seanjc@google.com> writes: >> >> > +Vitaly >> > >> > On Thu, Apr 14, 2022, Anton Romanov wrote: >> >> ... >> >> >> @@ -8646,9 +8659,12 @@ static void tsc_khz_changed(void *data) >> >> struct cpufreq_freqs *freq = data; >> >> unsigned long khz = 0; >> >> >> >> + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) >> >> + return; >> > >> > Vitaly, >> > >> > The Hyper-V guest code also sets cpu_tsc_khz, should we WARN if that notifier is >> > invoked and Hyper-V told us there's a constant TSC? >> > >> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> > index ab336f7c82e4..ca8e20f5ffc0 100644 >> > --- a/arch/x86/kvm/x86.c >> > +++ b/arch/x86/kvm/x86.c >> > @@ -8701,6 +8701,8 @@ static void kvm_hyperv_tsc_notifier(void) >> > struct kvm *kvm; >> > int cpu; >> > >> > + WARN_ON_ONCE(boot_cpu_has(X86_FEATURE_CONSTANT_TSC)); >> > + >> > mutex_lock(&kvm_lock); >> > list_for_each_entry(kvm, &vm_list, vm_list) >> > kvm_make_mclock_inprogress_request(kvm); >> > >> >> (apologies for the delayed reply) >> >> No, I think Hyper-V's "Reenlightenment" feature overrides (re-defines?) >> X86_FEATURE_CONSTANT_TSC. E.g. I've checked a VM on E5-2667 v4 >> (Broadwell) CPU with no TSC scaling. This VM has 'constant_tsc' and will >> certainly get reenlightenment irq on migration. > > Ooh, so that a VM with a constant TSC be live migrated to another system with a > constant, but different, TSC. Does the below look correct as fixup for this patch? > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index ab336f7c82e4..a944e4ba5532 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -8708,10 +8708,18 @@ static void kvm_hyperv_tsc_notifier(void) > /* no guest entries from this point */ > hyperv_stop_tsc_emulation(); > > - /* TSC frequency always matches when on Hyper-V */ > - for_each_present_cpu(cpu) > - per_cpu(cpu_tsc_khz, cpu) = tsc_khz; > - kvm_max_guest_tsc_khz = tsc_khz; > + /* > + * TSC frequency always matches when on Hyper-V. Skip the updates if > + * the TSC is "officially" constant, in which case KVM doesn't use the > + * per-CPU and max variables. Note, the notifier can still fire with > + * a constant TSC, e.g. if this VM (KVM is a Hyper-V guest) is migrated > + * to a system with a different TSC frequency. > + */ > + if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) { > + for_each_present_cpu(cpu) > + per_cpu(cpu_tsc_khz, cpu) = tsc_khz; > + kvm_max_guest_tsc_khz = tsc_khz; > + } Looks good for cpu_tsc_khz but I'm not particularly sure about kvm_max_guest_tsc_khz. AFAIU, kvm_max_guest_tsc_khz is normally only set when TSC scaling is available (kvm_has_tsc_control) and Hyper-V wasn't exposing it as the field was missing in Enlightened VMCS. The most recent version (https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/datatypes/hv_vmx_enlightened_vmcs), however, has 'TscMultiplier' so I guess it's possible now. (Note: eVMCS version remains '1', just a few fields were added, interesting). Another thing is why do we set kvm_max_guest_tsc_khz to 'tsc_khz' as normally, this is the maximum possible guest-VM TSC frequency (see kvm_arch_hardware_setup()). With reenlightenment, the VM can be migrated to a host with different TSC frequency, this means the maximum possible guest-VM TSC frequency may change. What do we do with L2 VM with 'unsupported' configurations after that? TL;DR: I *think* we can drop kvm_max_guest_tsc_khz setting from kvm_hyperv_tsc_notifier() for now as it a) doesn't seem to be needed for non-tscscaling case and b) correct for the tsc-scaling case. I'll need to investigate how recent Hyper-V versions work when CPU offers TSC-scaling feature. > > list_for_each_entry(kvm, &vm_list, vm_list) { > __kvm_start_pvclock_update(kvm); >
On Tue, Apr 19, 2022, Vitaly Kuznetsov wrote: > Sean Christopherson <seanjc@google.com> writes: > > > On Tue, Apr 19, 2022, Vitaly Kuznetsov wrote: > >> Sean Christopherson <seanjc@google.com> writes: > >> > The Hyper-V guest code also sets cpu_tsc_khz, should we WARN if that notifier is > >> > invoked and Hyper-V told us there's a constant TSC? ... > >> (apologies for the delayed reply) > >> > >> No, I think Hyper-V's "Reenlightenment" feature overrides (re-defines?) > >> X86_FEATURE_CONSTANT_TSC. E.g. I've checked a VM on E5-2667 v4 > >> (Broadwell) CPU with no TSC scaling. This VM has 'constant_tsc' and will > >> certainly get reenlightenment irq on migration. > > > > Ooh, so that a VM with a constant TSC be live migrated to another system with a > > constant, but different, TSC. Does the below look correct as fixup for this patch? > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index ab336f7c82e4..a944e4ba5532 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -8708,10 +8708,18 @@ static void kvm_hyperv_tsc_notifier(void) > > /* no guest entries from this point */ > > hyperv_stop_tsc_emulation(); > > > > - /* TSC frequency always matches when on Hyper-V */ > > - for_each_present_cpu(cpu) > > - per_cpu(cpu_tsc_khz, cpu) = tsc_khz; > > - kvm_max_guest_tsc_khz = tsc_khz; > > + /* > > + * TSC frequency always matches when on Hyper-V. Skip the updates if > > + * the TSC is "officially" constant, in which case KVM doesn't use the > > + * per-CPU and max variables. Note, the notifier can still fire with > > + * a constant TSC, e.g. if this VM (KVM is a Hyper-V guest) is migrated > > + * to a system with a different TSC frequency. > > + */ > > + if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) { > > + for_each_present_cpu(cpu) > > + per_cpu(cpu_tsc_khz, cpu) = tsc_khz; > > + kvm_max_guest_tsc_khz = tsc_khz; > > + } > > Looks good for cpu_tsc_khz but I'm not particularly sure about > kvm_max_guest_tsc_khz. Doh, ignore that, I got kvm_max_guest_tsc_khz confused with max_tsc_khz.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 547ba00ef64f..4ae9a03f549d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2907,6 +2907,19 @@ static void kvm_update_masterclock(struct kvm *kvm) kvm_end_pvclock_update(kvm); } +/* + * If kvm is built into kernel it is possible that tsc_khz saved into + * per-cpu cpu_tsc_khz was yet unrefined value. If CPU provides CONSTANT_TSC it + * doesn't make sense to snapshot it anyway so just return tsc_khz + */ +static unsigned long get_cpu_tsc_khz(void) +{ + if (static_cpu_has(X86_FEATURE_CONSTANT_TSC)) + return tsc_khz; + else + return __this_cpu_read(cpu_tsc_khz); +} + /* Called within read_seqcount_begin/retry for kvm->pvclock_sc. */ static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) { @@ -2917,7 +2930,7 @@ static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) get_cpu(); data->flags = 0; - if (ka->use_master_clock && __this_cpu_read(cpu_tsc_khz)) { + if (ka->use_master_clock && get_cpu_tsc_khz()) { #ifdef CONFIG_X86_64 struct timespec64 ts; @@ -2931,7 +2944,7 @@ static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) data->flags |= KVM_CLOCK_TSC_STABLE; hv_clock.tsc_timestamp = ka->master_cycle_now; hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset; - kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL, + kvm_get_time_scale(NSEC_PER_SEC, get_cpu_tsc_khz() * 1000LL, &hv_clock.tsc_shift, &hv_clock.tsc_to_system_mul); data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc); @@ -3049,7 +3062,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) /* Keep irq disabled to prevent changes to the clock */ local_irq_save(flags); - tgt_tsc_khz = __this_cpu_read(cpu_tsc_khz); + tgt_tsc_khz = get_cpu_tsc_khz(); if (unlikely(tgt_tsc_khz == 0)) { local_irq_restore(flags); kvm_make_request(KVM_REQ_CLOCK_UPDATE, v); @@ -8646,9 +8659,12 @@ static void tsc_khz_changed(void *data) struct cpufreq_freqs *freq = data; unsigned long khz = 0; + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) + return; + if (data) khz = freq->new; - else if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) + else khz = cpufreq_quick_get(raw_smp_processor_id()); if (!khz) khz = tsc_khz;
Don't snapshot tsc_khz into per-cpu cpu_tsc_khz if the host TSC is constant, in which case the actual TSC frequency will never change and thus capturing TSC during initialization is unnecessary, KVM can simply use tsc_khz. This value is snapshotted from kvm_timer_init->kvmclock_cpu_online->tsc_khz_changed(NULL) On CPUs with constant TSC, but not a hardware-specified TSC frequency, snapshotting cpu_tsc_khz and using that to set a VM's target TSC frequency can lead to VM to think its TSC frequency is not what it actually is if refining the TSC completes after KVM snapshots tsc_khz. The actual frequency never changes, only the kernel's calculation of what that frequency is changes. Ideally, KVM would not be able to race with TSC refinement, or would have a hook into tsc_refine_calibration_work() to get an alert when refinement is complete. Avoiding the race altogether isn't practical as refinement takes a relative eternity; it's deliberately put on a work queue outside of the normal boot sequence to avoid unnecessarily delaying boot. Adding a hook is doable, but somewhat gross due to KVM's ability to be built as a module. And if the TSC is constant, which is likely the case for every VMX/SVM-capable CPU produced in the last decade, the race can be hit if and only if userspace is able to create a VM before TSC refinement completes; refinement is slow, but not that slow. For now, punt on a proper fix, as not taking a snapshot can help some uses cases and not taking a snapshot is arguably correct irrespective of the race with refinement. Signed-off-by: Anton Romanov <romanton@google.com> --- arch/x86/kvm/x86.c | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-)