Message ID | 20230322093117.48335-1-likexu@tencent.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] KVM: x86/pmu: Fix emulation on Intel counters' bit width | expand |
On Wed, Mar 22, 2023 at 10:31 AM Like Xu <like.xu.linux@gmail.com> wrote: > > From: Like Xu <likexu@tencent.com> > > Per Intel SDM, the bit width of a PMU counter is specified via CPUID > only if the vCPU has FW_WRITE[bit 13] on IA32_PERF_CAPABILITIES. > When the FW_WRITE bit is not set, only EAX is valid and out-of-bounds > bits accesses do not generate #GP. Conversely when this bit is set, #GP > for out-of-bounds bits accesses will also appear on the fixed counters. > vPMU currently does not support emulation of bit widths lower than 32 > bits or higher than its host capability. Can you please point out the date and paragraph of the SDM? Paolo
On 27/3/2023 10:30 pm, Paolo Bonzini wrote: > On Wed, Mar 22, 2023 at 10:31 AM Like Xu <like.xu.linux@gmail.com> wrote: >> >> From: Like Xu <likexu@tencent.com> >> >> Per Intel SDM, the bit width of a PMU counter is specified via CPUID >> only if the vCPU has FW_WRITE[bit 13] on IA32_PERF_CAPABILITIES. >> When the FW_WRITE bit is not set, only EAX is valid and out-of-bounds >> bits accesses do not generate #GP. Conversely when this bit is set, #GP >> for out-of-bounds bits accesses will also appear on the fixed counters. >> vPMU currently does not support emulation of bit widths lower than 32 >> bits or higher than its host capability. > > Can you please point out the date and paragraph of the SDM? > > Paolo > 25462-078US, December 2022 20.2.6 Full-Width Writes to Performance Counter Registers The general-purpose performance counter registers IA32_PMCx are writable via WRMSR instruction. However, the value written into IA32_PMCx by WRMSR is the signed extended 64-bit value of the EAX[31:0] input of WRMSR. A processor that supports full-width writes to the general-purpose performance counters enumerated by CPUID.0AH:EAX[15:8] will set IA32_PERF_CAPABILITIES[13] to enumerate its full-width-write capability See Figure 20-65. If IA32_PERF_CAPABILITIES.FW_WRITE[bit 13] =1, each IA32_PMCi is accompanied by a corresponding alias address starting at 4C1H for IA32_A_PMC0. The bit width of the performance monitoring counters is specified in CPUID.0AH:EAX[23:16]. If IA32_A_PMCi is present, the 64-bit input value (EDX:EAX) of WRMSR to IA32_A_PMCi will cause IA32_PMCi to be updated by: COUNTERWIDTH = CPUID.0AH:EAX[23:16] bit width of the performance monitoring counter IA32_PMCi[COUNTERWIDTH-1:32] := EDX[COUNTERWIDTH-33:0]); IA32_PMCi[31:0] := EAX[31:0]; EDX[63:COUNTERWIDTH] are reserved --- Some might argue that this is all talking about GP counters, not fixed counters. In fact, the full-width write hw behaviour is presumed to do the same thing for all counters. Commercial hardware will not use less than 32 bits or a bit width like 46 bits. A KVM user space (such as selftests) may set a strange bit-width, for example using 33 bits, and based on the current code, writing the reserved bits for #fixed counters doesn't cause #GP. Also when the guest does not have the Full-Width feature, the fixed counters can be more than 32 bits wide via CPUID, while the #GP counter is only 32 bits wide, which is also monstrous. The current KVM is also not capable of emulating counter overflow when KVM user space is set to a bit width of less than 32 bits w/ FW_WRITE. The above SDM-undefined behaviour led to this fix, which may lift some of the fog.
On 3/28/23 11:16, Like Xu wrote: > > > If IA32_PERF_CAPABILITIES.FW_WRITE[bit 13] =1, each IA32_PMCi is > accompanied by a > corresponding alias address starting at 4C1H for IA32_A_PMC0. > > The bit width of the performance monitoring counters is specified in > CPUID.0AH:EAX[23:16]. > If IA32_A_PMCi is present, the 64-bit input value (EDX:EAX) of WRMSR to > IA32_A_PMCi will cause > IA32_PMCi to be updated by: > > COUNTERWIDTH = > CPUID.0AH:EAX[23:16] bit width of the performance monitoring > counter > IA32_PMCi[COUNTERWIDTH-1:32] := EDX[COUNTERWIDTH-33:0]); > IA32_PMCi[31:0] := EAX[31:0]; > EDX[63:COUNTERWIDTH] are reserved > > --- > > Some might argue that this is all talking about GP counters, not > fixed counters. In fact, the full-width write hw behaviour is > presumed to do the same thing for all counters. But the above behavior, and the #GP, is only true for IA32_A_PMCi (the full-witdh MSR). Did I understand correctly that the behavior for fixed counters is changed without introducing an alias MSR? Paolo
On 28/3/2023 5:20 pm, Paolo Bonzini wrote: > On 3/28/23 11:16, Like Xu wrote: >> >> >> If IA32_PERF_CAPABILITIES.FW_WRITE[bit 13] =1, each IA32_PMCi is accompanied by a >> corresponding alias address starting at 4C1H for IA32_A_PMC0. >> >> The bit width of the performance monitoring counters is specified in >> CPUID.0AH:EAX[23:16]. >> If IA32_A_PMCi is present, the 64-bit input value (EDX:EAX) of WRMSR to >> IA32_A_PMCi will cause >> IA32_PMCi to be updated by: >> >> COUNTERWIDTH = >> CPUID.0AH:EAX[23:16] bit width of the performance monitoring counter >> IA32_PMCi[COUNTERWIDTH-1:32] := EDX[COUNTERWIDTH-33:0]); >> IA32_PMCi[31:0] := EAX[31:0]; >> EDX[63:COUNTERWIDTH] are reserved >> >> --- >> >> Some might argue that this is all talking about GP counters, not >> fixed counters. In fact, the full-width write hw behaviour is >> presumed to do the same thing for all counters. > But the above behavior, and the #GP, is only true for IA32_A_PMCi (the > full-witdh MSR). Did I understand correctly that the behavior for fixed > counters is changed without introducing an alias MSR? > > Paolo > If true, why introducing those alias MSRs ? My archaeological findings are: a platform w/o full-witdh like Westmere (has 3-fixed counters already) is declared to have a counter width (R:48, W:32) and its successor Sandy Bridge has (R:48 , W: 32/48). Thus I think the behaviour of the fixed counter has changed from there, and the alias GP MSRs were introduced to keep the support on 32-bit writes on #GP counters (via original address). [*] Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes (252046-030, January 2011) Table 30-18 Core PMU Comparison.
On Tue, Mar 28, 2023, Like Xu wrote: > On 28/3/2023 5:20 pm, Paolo Bonzini wrote: > > On 3/28/23 11:16, Like Xu wrote: > > > > > > > > > If IA32_PERF_CAPABILITIES.FW_WRITE[bit 13] =1, each IA32_PMCi is accompanied by a > > > corresponding alias address starting at 4C1H for IA32_A_PMC0. > > > > > > The bit width of the performance monitoring counters is specified in > > > CPUID.0AH:EAX[23:16]. > > > If IA32_A_PMCi is present, the 64-bit input value (EDX:EAX) of WRMSR > > > to IA32_A_PMCi will cause > > > IA32_PMCi to be updated by: > > > > > > �����COUNTERWIDTH = > > > �������� CPUID.0AH:EAX[23:16] bit width of the performance monitoring counter > > > �����IA32_PMCi[COUNTERWIDTH-1:32] := EDX[COUNTERWIDTH-33:0]); > > > �����IA32_PMCi[31:0] := EAX[31:0]; > > > �����EDX[63:COUNTERWIDTH] are reserved > > > > > > --- > > > > > > Some might argue that this is all talking about GP counters, not > > > fixed counters. In fact, the full-width write hw behaviour is > > > presumed to do the same thing for all counters. > > But the above behavior, and the #GP, is only true for IA32_A_PMCi (the > > full-witdh MSR).� Did I understand correctly that the behavior for fixed > > counters is changed without introducing an alias MSR? > > > > Paolo > > > > If true, why introducing those alias MSRs ? My guess is there is/was software in the field that wrote -1 to the GP counters, i.e. would have been broken by the new #GP behavior. > My archaeological findings are: > > a platform w/o full-witdh like Westmere (has 3-fixed counters already) is > declared to have a counter width (R:48, W:32) and its successor Sandy Bridge > has (R:48 , W: 32/48). > > Thus I think the behaviour of the fixed counter has changed from there, and > the alias GP MSRs were introduced to keep the support on 32-bit writes on #GP > counters (via original address). FWIW, I see the #GP behavior for fixed counters on Haswell, so this does seem to be the case. That said, I would like to get confirmation from Intel that this is architectural and/or working as intended. Like, can you follow up with Intel to get clarification/confirmation? And ideally an SDM update...
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index e8a3be0b9df9..d38b820d6b9e 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -470,6 +470,12 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) pmc_update_sample_period(pmc); return 0; } else if ((pmc = get_fixed_pmc(pmu, msr))) { + if (fw_writes_is_enabled(vcpu)) { + if (data & ~pmu->counter_bitmask[KVM_PMC_FIXED]) + return 1; + } else if (!msr_info->host_initiated) { + data = (s64)(s32)data; + } pmc->counter += data - pmc_read_counter(pmc); pmc_update_sample_period(pmc); return 0; @@ -516,6 +522,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) union cpuid10_edx edx; u64 perf_capabilities; u64 counter_mask; + bool fw_wr = fw_writes_is_enabled(vcpu); int i; pmu->nr_arch_gp_counters = 0; @@ -543,6 +550,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters, kvm_pmu_cap.num_counters_gp); + eax.split.bit_width = fw_wr ? max_t(int, 32, eax.split.bit_width) : 32; eax.split.bit_width = min_t(int, eax.split.bit_width, kvm_pmu_cap.bit_width_gp); pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << eax.split.bit_width) - 1; @@ -558,6 +566,8 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) min3(ARRAY_SIZE(fixed_pmc_events), (size_t) edx.split.num_counters_fixed, (size_t)kvm_pmu_cap.num_counters_fixed); + edx.split.bit_width_fixed = fw_wr ? + max_t(int, 32, edx.split.bit_width_fixed) : 32; edx.split.bit_width_fixed = min_t(int, edx.split.bit_width_fixed, kvm_pmu_cap.bit_width_fixed); pmu->counter_bitmask[KVM_PMC_FIXED] =