Message ID | 20250121142510.358996-1-andrew.cooper3@citrix.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [for-4.20] x86/intel: Fix PERF_GLOBAL fixup when virtualised | expand |
On 21.01.2025 15:25, Andrew Cooper wrote: > Logic using performance counters needs to look at > MSR_MISC_ENABLE.PERF_AVAILABLE before touching any other resources. > > When virtualised under ESX, Xen dies with a #GP fault trying to read > MSR_CORE_PERF_GLOBAL_CTRL. > > Factor this logic out into a separate function (it's already too squashed to > the RHS), and insert a check of MSR_MISC_ENABLE.PERF_AVAILABLE. > > This also limits setting X86_FEATURE_ARCH_PERFMON, although oprofile (the only > consumer of this flag) cross-checks too. > > Reported-by: Jonathan Katz <jonathan.katz@aptar.com> > Link: https://xcp-ng.org/forum/topic/10286/nesting-xcp-ng-on-esx-8 > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > --- > CC: Jan Beulich <JBeulich@suse.com> > CC: Roger Pau Monné <roger.pau@citrix.com> > CC: Oleksii Kurochko <oleksii.kurochko@gmail.com> > > Untested, but this is the same pattern used by oprofile and watchdog setup. Wow, in the oprofile case with pretty bad open-coding. > I've intentionally stopped using Intel style. This file is already mixed (as > visible even in context), and it doesn't remotely resemble it's Linux origin > any more. I guess you mean s/Intel/Linux/ here? (Yes, I'm entirely fine with using Xen style there.) > --- a/xen/arch/x86/cpu/intel.c > +++ b/xen/arch/x86/cpu/intel.c > @@ -535,39 +535,49 @@ static void intel_log_freq(const struct cpuinfo_x86 *c) > printk("%u MHz\n", (factor * max_ratio + 50) / 100); > } > > +static void init_intel_perf(struct cpuinfo_x86 *c) > +{ > + uint64_t val; > + unsigned int eax, ver, nr_cnt; > + > + if ( c->cpuid_level <= 9 || > + rdmsr_safe(MSR_IA32_MISC_ENABLE, val) || In e.g. intel_unlock_cpuid_leaves() and early_init_intel() and in particular also in boot/head.S we access this MSR without recovery attached. Is there a reason rdmsr_safe() needs using here? > + !(val & MSR_IA32_MISC_ENABLE_PERF_AVAIL) ) > + return; > + > + eax = cpuid_eax(10); > + ver = eax & 0xff; > + nr_cnt = (eax >> 8) & 0xff; > + > + if ( ver && nr_cnt > 1 && nr_cnt <= 32 ) > + { > + unsigned int cnt_mask = (1UL << nr_cnt) - 1; > + > + /* > + * On (some?) Sapphire/Emerald Rapids platforms each package-BSP > + * starts with all the enable bits for the general-purpose PMCs > + * cleared. Adjust so counters can be enabled from EVNTSEL. > + */ > + rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, val); > + > + if ( (val & cnt_mask) != cnt_mask ) > + { > + printk("FIRMWARE BUG: CPU%u invalid PERF_GLOBAL_CTRL: %#"PRIx64" adjusting to %#"PRIx64"\n", > + smp_processor_id(), val, val | cnt_mask); > + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, val | cnt_mask); > + } > + } > + > + __set_bit(X86_FEATURE_ARCH_PERFMON, c->x86_capability); This moved, without the description suggesting the move is intentional. It did live at the end of the earlier scope before, ... > +} > + > static void cf_check init_intel(struct cpuinfo_x86 *c) > { > /* Detect the extended topology information if available */ > detect_extended_topology(c); > > init_intel_cacheinfo(c); > - if (c->cpuid_level > 9) { > - unsigned eax = cpuid_eax(10); > - unsigned int cnt = (eax >> 8) & 0xff; > - > - /* Check for version and the number of counters */ > - if ((eax & 0xff) && (cnt > 1) && (cnt <= 32)) { > - uint64_t global_ctrl; > - unsigned int cnt_mask = (1UL << cnt) - 1; > - > - /* > - * On (some?) Sapphire/Emerald Rapids platforms each > - * package-BSP starts with all the enable bits for the > - * general-purpose PMCs cleared. Adjust so counters > - * can be enabled from EVNTSEL. > - */ > - rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, global_ctrl); > - if ((global_ctrl & cnt_mask) != cnt_mask) { > - printk("CPU%u: invalid PERF_GLOBAL_CTRL: %#" > - PRIx64 " adjusting to %#" PRIx64 "\n", > - smp_processor_id(), global_ctrl, > - global_ctrl | cnt_mask); > - wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, > - global_ctrl | cnt_mask); > - } > - __set_bit(X86_FEATURE_ARCH_PERFMON, c->x86_capability); > - } > - } ... as can be seen here. Jan > + init_intel_perf(c); > > if ( !cpu_has(c, X86_FEATURE_XTOPOLOGY) ) > { > > base-commit: c3f5d1bb40b57d467cb4051eafa86f5933ec9003
On 21/01/2025 3:03 pm, Jan Beulich wrote: > On 21.01.2025 15:25, Andrew Cooper wrote: >> Logic using performance counters needs to look at >> MSR_MISC_ENABLE.PERF_AVAILABLE before touching any other resources. >> >> When virtualised under ESX, Xen dies with a #GP fault trying to read >> MSR_CORE_PERF_GLOBAL_CTRL. >> >> Factor this logic out into a separate function (it's already too squashed to >> the RHS), and insert a check of MSR_MISC_ENABLE.PERF_AVAILABLE. >> >> This also limits setting X86_FEATURE_ARCH_PERFMON, although oprofile (the only >> consumer of this flag) cross-checks too. Fixes: 6bdb965178bb ("x86/intel: ensure Global Performance Counter Control is setup correctly") (fixed up locally). >> Reported-by: Jonathan Katz <jonathan.katz@aptar.com> >> Link: https://xcp-ng.org/forum/topic/10286/nesting-xcp-ng-on-esx-8 >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >> --- >> CC: Jan Beulich <JBeulich@suse.com> >> CC: Roger Pau Monné <roger.pau@citrix.com> >> CC: Oleksii Kurochko <oleksii.kurochko@gmail.com> >> >> Untested, but this is the same pattern used by oprofile and watchdog setup. > Wow, in the oprofile case with pretty bad open-coding. > >> I've intentionally stopped using Intel style. This file is already mixed (as >> visible even in context), and it doesn't remotely resemble it's Linux origin >> any more. > I guess you mean s/Intel/Linux/ here? (Yes, I'm entirely fine with using Xen > style there.) Oops yes. > >> --- a/xen/arch/x86/cpu/intel.c >> +++ b/xen/arch/x86/cpu/intel.c >> @@ -535,39 +535,49 @@ static void intel_log_freq(const struct cpuinfo_x86 *c) >> printk("%u MHz\n", (factor * max_ratio + 50) / 100); >> } >> >> +static void init_intel_perf(struct cpuinfo_x86 *c) >> +{ >> + uint64_t val; >> + unsigned int eax, ver, nr_cnt; >> + >> + if ( c->cpuid_level <= 9 || >> + rdmsr_safe(MSR_IA32_MISC_ENABLE, val) || > In e.g. intel_unlock_cpuid_leaves() and early_init_intel() and in particular > also in boot/head.S we access this MSR without recovery attached. Is there a > reason rdmsr_safe() needs using here? Abundance of caution. cpufreq/hwp.c uses a safe accessor. Given the regular NMI path works, I doubt we need the _safe() here. As future work, it's accessed loads of times, so I'm highly tempted to have the BSP sanitise it once, and have the APs copy the "global" value. > >> + !(val & MSR_IA32_MISC_ENABLE_PERF_AVAIL) ) >> + return; >> + >> + eax = cpuid_eax(10); >> + ver = eax & 0xff; >> + nr_cnt = (eax >> 8) & 0xff; >> + >> + if ( ver && nr_cnt > 1 && nr_cnt <= 32 ) >> + { >> + unsigned int cnt_mask = (1UL << nr_cnt) - 1; >> + >> + /* >> + * On (some?) Sapphire/Emerald Rapids platforms each package-BSP >> + * starts with all the enable bits for the general-purpose PMCs >> + * cleared. Adjust so counters can be enabled from EVNTSEL. >> + */ >> + rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, val); >> + >> + if ( (val & cnt_mask) != cnt_mask ) >> + { >> + printk("FIRMWARE BUG: CPU%u invalid PERF_GLOBAL_CTRL: %#"PRIx64" adjusting to %#"PRIx64"\n", >> + smp_processor_id(), val, val | cnt_mask); >> + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, val | cnt_mask); >> + } >> + } >> + >> + __set_bit(X86_FEATURE_ARCH_PERFMON, c->x86_capability); > This moved, without the description suggesting the move is intentional. > It did live at the end of the earlier scope before, ... Final paragraph of the commit message? If PERF_AVAIL is clear, we don't have ARCH_PERFMON, whatever the CPUID leaves say. OTOH, this bit really doesn't serve much value. Given oprofile cross-checks everything anyway, I think it can be dropped. ~Andrew
On 21.01.2025 16:23, Andrew Cooper wrote: > On 21/01/2025 3:03 pm, Jan Beulich wrote: >> On 21.01.2025 15:25, Andrew Cooper wrote: >>> Logic using performance counters needs to look at >>> MSR_MISC_ENABLE.PERF_AVAILABLE before touching any other resources. >>> >>> When virtualised under ESX, Xen dies with a #GP fault trying to read >>> MSR_CORE_PERF_GLOBAL_CTRL. >>> >>> Factor this logic out into a separate function (it's already too squashed to >>> the RHS), and insert a check of MSR_MISC_ENABLE.PERF_AVAILABLE. >>> >>> This also limits setting X86_FEATURE_ARCH_PERFMON, although oprofile (the only >>> consumer of this flag) cross-checks too. > > Fixes: 6bdb965178bb ("x86/intel: ensure Global Performance Counter > Control is setup correctly") > > (fixed up locally). > >>> Reported-by: Jonathan Katz <jonathan.katz@aptar.com> >>> Link: https://xcp-ng.org/forum/topic/10286/nesting-xcp-ng-on-esx-8 >>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >>> --- >>> CC: Jan Beulich <JBeulich@suse.com> >>> CC: Roger Pau Monné <roger.pau@citrix.com> >>> CC: Oleksii Kurochko <oleksii.kurochko@gmail.com> >>> >>> Untested, but this is the same pattern used by oprofile and watchdog setup. >> Wow, in the oprofile case with pretty bad open-coding. >> >>> I've intentionally stopped using Intel style. This file is already mixed (as >>> visible even in context), and it doesn't remotely resemble it's Linux origin >>> any more. >> I guess you mean s/Intel/Linux/ here? (Yes, I'm entirely fine with using Xen >> style there.) > > Oops yes. > >> >>> --- a/xen/arch/x86/cpu/intel.c >>> +++ b/xen/arch/x86/cpu/intel.c >>> @@ -535,39 +535,49 @@ static void intel_log_freq(const struct cpuinfo_x86 *c) >>> printk("%u MHz\n", (factor * max_ratio + 50) / 100); >>> } >>> >>> +static void init_intel_perf(struct cpuinfo_x86 *c) >>> +{ >>> + uint64_t val; >>> + unsigned int eax, ver, nr_cnt; >>> + >>> + if ( c->cpuid_level <= 9 || >>> + rdmsr_safe(MSR_IA32_MISC_ENABLE, val) || >> In e.g. intel_unlock_cpuid_leaves() and early_init_intel() and in particular >> also in boot/head.S we access this MSR without recovery attached. Is there a >> reason rdmsr_safe() needs using here? > > Abundance of caution. cpufreq/hwp.c uses a safe accessor. Perhaps it (and other places) shouldn't? > Given the regular NMI path works, I doubt we need the _safe() here. > > As future work, it's accessed loads of times, so I'm highly tempted to > have the BSP sanitise it once, and have the APs copy the "global" value. > >> >>> + !(val & MSR_IA32_MISC_ENABLE_PERF_AVAIL) ) >>> + return; >>> + >>> + eax = cpuid_eax(10); >>> + ver = eax & 0xff; >>> + nr_cnt = (eax >> 8) & 0xff; >>> + >>> + if ( ver && nr_cnt > 1 && nr_cnt <= 32 ) >>> + { >>> + unsigned int cnt_mask = (1UL << nr_cnt) - 1; >>> + >>> + /* >>> + * On (some?) Sapphire/Emerald Rapids platforms each package-BSP >>> + * starts with all the enable bits for the general-purpose PMCs >>> + * cleared. Adjust so counters can be enabled from EVNTSEL. >>> + */ >>> + rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, val); >>> + >>> + if ( (val & cnt_mask) != cnt_mask ) >>> + { >>> + printk("FIRMWARE BUG: CPU%u invalid PERF_GLOBAL_CTRL: %#"PRIx64" adjusting to %#"PRIx64"\n", >>> + smp_processor_id(), val, val | cnt_mask); >>> + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, val | cnt_mask); >>> + } >>> + } >>> + >>> + __set_bit(X86_FEATURE_ARCH_PERFMON, c->x86_capability); >> This moved, without the description suggesting the move is intentional. >> It did live at the end of the earlier scope before, ... > > Final paragraph of the commit message? > > If PERF_AVAIL is clear, we don't have ARCH_PERFMON, whatever the CPUID > leaves say. Hmm, the final paragraph in the posting that I got in my inbox was: "This also limits setting X86_FEATURE_ARCH_PERFMON, although oprofile (the only consumer of this flag) cross-checks too." Which says quite the opposite: You now set the bit in more cases, i.e. when nr_cnt is out of range or when ver is zero. > OTOH, this bit really doesn't serve much value. Given oprofile > cross-checks everything anyway, I think it can be dropped. Hmm, it's not entirely straightforward to see that all uses of cpu_has_arch_perfmon can really be done away with. Jan
On 21/01/2025 3:58 pm, Jan Beulich wrote: > On 21.01.2025 16:23, Andrew Cooper wrote: >> On 21/01/2025 3:03 pm, Jan Beulich wrote: >>> On 21.01.2025 15:25, Andrew Cooper wrote: >>>> + !(val & MSR_IA32_MISC_ENABLE_PERF_AVAIL) ) >>>> + return; >>>> + >>>> + eax = cpuid_eax(10); >>>> + ver = eax & 0xff; >>>> + nr_cnt = (eax >> 8) & 0xff; >>>> + >>>> + if ( ver && nr_cnt > 1 && nr_cnt <= 32 ) >>>> + { >>>> + unsigned int cnt_mask = (1UL << nr_cnt) - 1; >>>> + >>>> + /* >>>> + * On (some?) Sapphire/Emerald Rapids platforms each package-BSP >>>> + * starts with all the enable bits for the general-purpose PMCs >>>> + * cleared. Adjust so counters can be enabled from EVNTSEL. >>>> + */ >>>> + rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, val); >>>> + >>>> + if ( (val & cnt_mask) != cnt_mask ) >>>> + { >>>> + printk("FIRMWARE BUG: CPU%u invalid PERF_GLOBAL_CTRL: %#"PRIx64" adjusting to %#"PRIx64"\n", >>>> + smp_processor_id(), val, val | cnt_mask); >>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, val | cnt_mask); >>>> + } >>>> + } >>>> + >>>> + __set_bit(X86_FEATURE_ARCH_PERFMON, c->x86_capability); >>> This moved, without the description suggesting the move is intentional. >>> It did live at the end of the earlier scope before, ... >> Final paragraph of the commit message? >> >> If PERF_AVAIL is clear, we don't have ARCH_PERFMON, whatever the CPUID >> leaves say. > Hmm, the final paragraph in the posting that I got in my inbox was: > > "This also limits setting X86_FEATURE_ARCH_PERFMON, although oprofile (the only > consumer of this flag) cross-checks too." > > Which says quite the opposite: You now set the bit in more cases, i.e. > when nr_cnt is out of range or when ver is zero. Oh, right. Yeah, that was unintentional. I'll adjust. ~Andrew
On Tue, Jan 21, 2025 at 02:25:10PM +0000, Andrew Cooper wrote: > Logic using performance counters needs to look at > MSR_MISC_ENABLE.PERF_AVAILABLE before touching any other resources. > > When virtualised under ESX, Xen dies with a #GP fault trying to read > MSR_CORE_PERF_GLOBAL_CTRL. > > Factor this logic out into a separate function (it's already too squashed to > the RHS), and insert a check of MSR_MISC_ENABLE.PERF_AVAILABLE. > > This also limits setting X86_FEATURE_ARCH_PERFMON, although oprofile (the only > consumer of this flag) cross-checks too. > > Reported-by: Jonathan Katz <jonathan.katz@aptar.com> > Link: https://xcp-ng.org/forum/topic/10286/nesting-xcp-ng-on-esx-8 > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > --- > CC: Jan Beulich <JBeulich@suse.com> > CC: Roger Pau Monné <roger.pau@citrix.com> > CC: Oleksii Kurochko <oleksii.kurochko@gmail.com> > > Untested, but this is the same pattern used by oprofile and watchdog setup. > > I've intentionally stopped using Intel style. This file is already mixed (as > visible even in context), and it doesn't remotely resemble it's Linux origin > any more. > > For 4.20. This regressions has already been backported. > --- > xen/arch/x86/cpu/intel.c | 64 +++++++++++++++++++++++----------------- > 1 file changed, 37 insertions(+), 27 deletions(-) > > diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c > index 6a7347968ba2..586ae84d806d 100644 > --- a/xen/arch/x86/cpu/intel.c > +++ b/xen/arch/x86/cpu/intel.c > @@ -535,39 +535,49 @@ static void intel_log_freq(const struct cpuinfo_x86 *c) > printk("%u MHz\n", (factor * max_ratio + 50) / 100); > } > > +static void init_intel_perf(struct cpuinfo_x86 *c) > +{ > + uint64_t val; > + unsigned int eax, ver, nr_cnt; > + > + if ( c->cpuid_level <= 9 || > + rdmsr_safe(MSR_IA32_MISC_ENABLE, val) || > + !(val & MSR_IA32_MISC_ENABLE_PERF_AVAIL) ) > + return; > + > + eax = cpuid_eax(10); > + ver = eax & 0xff; > + nr_cnt = (eax >> 8) & 0xff; > + > + if ( ver && nr_cnt > 1 && nr_cnt <= 32 ) > + { > + unsigned int cnt_mask = (1UL << nr_cnt) - 1; > + > + /* > + * On (some?) Sapphire/Emerald Rapids platforms each package-BSP > + * starts with all the enable bits for the general-purpose PMCs > + * cleared. Adjust so counters can be enabled from EVNTSEL. > + */ > + rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, val); > + > + if ( (val & cnt_mask) != cnt_mask ) > + { > + printk("FIRMWARE BUG: CPU%u invalid PERF_GLOBAL_CTRL: %#"PRIx64" adjusting to %#"PRIx64"\n", > + smp_processor_id(), val, val | cnt_mask); > + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, val | cnt_mask); > + } > + } > + > + __set_bit(X86_FEATURE_ARCH_PERFMON, c->x86_capability); With this chunk moved back inside the if scope, and the Fixes tag added: Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Thanks, Roger.
On 1/21/25 3:25 PM, Andrew Cooper wrote: > Logic using performance counters needs to look at > MSR_MISC_ENABLE.PERF_AVAILABLE before touching any other resources. > > When virtualised under ESX, Xen dies with a #GP fault trying to read > MSR_CORE_PERF_GLOBAL_CTRL. > > Factor this logic out into a separate function (it's already too squashed to > the RHS), and insert a check of MSR_MISC_ENABLE.PERF_AVAILABLE. > > This also limits setting X86_FEATURE_ARCH_PERFMON, although oprofile (the only > consumer of this flag) cross-checks too. > > Reported-by: Jonathan Katz <jonathan.katz@aptar.com> > Link: https://xcp-ng.org/forum/topic/10286/nesting-xcp-ng-on-esx-8 > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > --- > CC: Jan Beulich <JBeulich@suse.com> > CC: Roger Pau Monné <roger.pau@citrix.com> > CC: Oleksii Kurochko <oleksii.kurochko@gmail.com> > > Untested, but this is the same pattern used by oprofile and watchdog setup. Probably it will make sense to wait for a response on the forum (you mentioned in the Link:) that the current one patch works? ~ Oleksii > > I've intentionally stopped using Intel style. This file is already mixed (as > visible even in context), and it doesn't remotely resemble it's Linux origin > any more. > > For 4.20. This regressions has already been backported. > --- > xen/arch/x86/cpu/intel.c | 64 +++++++++++++++++++++++----------------- > 1 file changed, 37 insertions(+), 27 deletions(-) > > diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c > index 6a7347968ba2..586ae84d806d 100644 > --- a/xen/arch/x86/cpu/intel.c > +++ b/xen/arch/x86/cpu/intel.c > @@ -535,39 +535,49 @@ static void intel_log_freq(const struct cpuinfo_x86 *c) > printk("%u MHz\n", (factor * max_ratio + 50) / 100); > } > > +static void init_intel_perf(struct cpuinfo_x86 *c) > +{ > + uint64_t val; > + unsigned int eax, ver, nr_cnt; > + > + if ( c->cpuid_level <= 9 || > + rdmsr_safe(MSR_IA32_MISC_ENABLE, val) || > + !(val & MSR_IA32_MISC_ENABLE_PERF_AVAIL) ) > + return; > + > + eax = cpuid_eax(10); > + ver = eax & 0xff; > + nr_cnt = (eax >> 8) & 0xff; > + > + if ( ver && nr_cnt > 1 && nr_cnt <= 32 ) > + { > + unsigned int cnt_mask = (1UL << nr_cnt) - 1; > + > + /* > + * On (some?) Sapphire/Emerald Rapids platforms each package-BSP > + * starts with all the enable bits for the general-purpose PMCs > + * cleared. Adjust so counters can be enabled from EVNTSEL. > + */ > + rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, val); > + > + if ( (val & cnt_mask) != cnt_mask ) > + { > + printk("FIRMWARE BUG: CPU%u invalid PERF_GLOBAL_CTRL: %#"PRIx64" adjusting to %#"PRIx64"\n", > + smp_processor_id(), val, val | cnt_mask); > + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, val | cnt_mask); > + } > + } > + > + __set_bit(X86_FEATURE_ARCH_PERFMON, c->x86_capability); > +} > + > static void cf_check init_intel(struct cpuinfo_x86 *c) > { > /* Detect the extended topology information if available */ > detect_extended_topology(c); > > init_intel_cacheinfo(c); > - if (c->cpuid_level > 9) { > - unsigned eax = cpuid_eax(10); > - unsigned int cnt = (eax >> 8) & 0xff; > - > - /* Check for version and the number of counters */ > - if ((eax & 0xff) && (cnt > 1) && (cnt <= 32)) { > - uint64_t global_ctrl; > - unsigned int cnt_mask = (1UL << cnt) - 1; > - > - /* > - * On (some?) Sapphire/Emerald Rapids platforms each > - * package-BSP starts with all the enable bits for the > - * general-purpose PMCs cleared. Adjust so counters > - * can be enabled from EVNTSEL. > - */ > - rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, global_ctrl); > - if ((global_ctrl & cnt_mask) != cnt_mask) { > - printk("CPU%u: invalid PERF_GLOBAL_CTRL: %#" > - PRIx64 " adjusting to %#" PRIx64 "\n", > - smp_processor_id(), global_ctrl, > - global_ctrl | cnt_mask); > - wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, > - global_ctrl | cnt_mask); > - } > - __set_bit(X86_FEATURE_ARCH_PERFMON, c->x86_capability); > - } > - } > + init_intel_perf(c); > > if ( !cpu_has(c, X86_FEATURE_XTOPOLOGY) ) > { > > base-commit: c3f5d1bb40b57d467cb4051eafa86f5933ec9003
diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c index 6a7347968ba2..586ae84d806d 100644 --- a/xen/arch/x86/cpu/intel.c +++ b/xen/arch/x86/cpu/intel.c @@ -535,39 +535,49 @@ static void intel_log_freq(const struct cpuinfo_x86 *c) printk("%u MHz\n", (factor * max_ratio + 50) / 100); } +static void init_intel_perf(struct cpuinfo_x86 *c) +{ + uint64_t val; + unsigned int eax, ver, nr_cnt; + + if ( c->cpuid_level <= 9 || + rdmsr_safe(MSR_IA32_MISC_ENABLE, val) || + !(val & MSR_IA32_MISC_ENABLE_PERF_AVAIL) ) + return; + + eax = cpuid_eax(10); + ver = eax & 0xff; + nr_cnt = (eax >> 8) & 0xff; + + if ( ver && nr_cnt > 1 && nr_cnt <= 32 ) + { + unsigned int cnt_mask = (1UL << nr_cnt) - 1; + + /* + * On (some?) Sapphire/Emerald Rapids platforms each package-BSP + * starts with all the enable bits for the general-purpose PMCs + * cleared. Adjust so counters can be enabled from EVNTSEL. + */ + rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, val); + + if ( (val & cnt_mask) != cnt_mask ) + { + printk("FIRMWARE BUG: CPU%u invalid PERF_GLOBAL_CTRL: %#"PRIx64" adjusting to %#"PRIx64"\n", + smp_processor_id(), val, val | cnt_mask); + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, val | cnt_mask); + } + } + + __set_bit(X86_FEATURE_ARCH_PERFMON, c->x86_capability); +} + static void cf_check init_intel(struct cpuinfo_x86 *c) { /* Detect the extended topology information if available */ detect_extended_topology(c); init_intel_cacheinfo(c); - if (c->cpuid_level > 9) { - unsigned eax = cpuid_eax(10); - unsigned int cnt = (eax >> 8) & 0xff; - - /* Check for version and the number of counters */ - if ((eax & 0xff) && (cnt > 1) && (cnt <= 32)) { - uint64_t global_ctrl; - unsigned int cnt_mask = (1UL << cnt) - 1; - - /* - * On (some?) Sapphire/Emerald Rapids platforms each - * package-BSP starts with all the enable bits for the - * general-purpose PMCs cleared. Adjust so counters - * can be enabled from EVNTSEL. - */ - rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, global_ctrl); - if ((global_ctrl & cnt_mask) != cnt_mask) { - printk("CPU%u: invalid PERF_GLOBAL_CTRL: %#" - PRIx64 " adjusting to %#" PRIx64 "\n", - smp_processor_id(), global_ctrl, - global_ctrl | cnt_mask); - wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, - global_ctrl | cnt_mask); - } - __set_bit(X86_FEATURE_ARCH_PERFMON, c->x86_capability); - } - } + init_intel_perf(c); if ( !cpu_has(c, X86_FEATURE_XTOPOLOGY) ) {
Logic using performance counters needs to look at MSR_MISC_ENABLE.PERF_AVAILABLE before touching any other resources. When virtualised under ESX, Xen dies with a #GP fault trying to read MSR_CORE_PERF_GLOBAL_CTRL. Factor this logic out into a separate function (it's already too squashed to the RHS), and insert a check of MSR_MISC_ENABLE.PERF_AVAILABLE. This also limits setting X86_FEATURE_ARCH_PERFMON, although oprofile (the only consumer of this flag) cross-checks too. Reported-by: Jonathan Katz <jonathan.katz@aptar.com> Link: https://xcp-ng.org/forum/topic/10286/nesting-xcp-ng-on-esx-8 Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Jan Beulich <JBeulich@suse.com> CC: Roger Pau Monné <roger.pau@citrix.com> CC: Oleksii Kurochko <oleksii.kurochko@gmail.com> Untested, but this is the same pattern used by oprofile and watchdog setup. I've intentionally stopped using Intel style. This file is already mixed (as visible even in context), and it doesn't remotely resemble it's Linux origin any more. For 4.20. This regressions has already been backported. --- xen/arch/x86/cpu/intel.c | 64 +++++++++++++++++++++++----------------- 1 file changed, 37 insertions(+), 27 deletions(-) base-commit: c3f5d1bb40b57d467cb4051eafa86f5933ec9003