diff mbox series

[RFC,1/1] perf stat: do not fatal if the leader is errored

Message ID 20220922071017.17398-1-dongli.zhang@oracle.com (mailing list archive)
State New, archived
Headers show
Series [RFC,1/1] perf stat: do not fatal if the leader is errored | expand

Commit Message

Dongli Zhang Sept. 22, 2022, 7:10 a.m. UTC
Add kvm@vger.kernel.org as this issue is in virtualization env.

The topdown metrics events became default since
commit 42641d6f4d15 ("perf stat: Add Topdown metrics events as default
events"). The perf will use 'slots' if the
/sys/bus/event_source/devices/cpu/events/slots is available.

Unfortunately, the 'slots' may not be supported in the virualization
environment. The hypervisor may not expose the 'slots' counter to the VM
in cpuid. As a result, the kernel may disable topdown slots and metrics
events in intel_pmu_init() if slots event is not in CPUID. E.g., both
c->weight and c->idxmsk64 are set to 0.

There will be below error on Icelake VM since 'slots' is the leader:

$ perf stat
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (slots).
/bin/dmesg | grep -i perf may provide additional information.

This is because the stat_handle_error() returns COUNTER_FATAL when the
'slots' is used as leader of events.

There are three options to fix the issue.

1. Do not expose /sys/bus/event_source/devices/cpu/events/slots to
userspace so that pmu_have_event(pmu->name, "slots") returns false.

2. Run cpuid at perf userspace and avoid using 'slots' if it is not
supported in cpuid.

3. Do not fatal perf if the leader is failed. Do not create events for an
evsel if its leader is already failed.

This RFC patch is with the 3rd option. Would you mind suggesting which
option is better?

Here is the output of patch.

$ perf stat -v
Using CPUID GenuineIntel-6-6A-6
slots -> cpu/event=0,umask=0x4/
topdown-retiring -> cpu/event=0,umask=0x80/
topdown-bad-spec -> cpu/event=0,umask=0x81/
topdown-fe-bound -> cpu/event=0,umask=0x82/
topdown-be-bound -> cpu/event=0,umask=0x83/
Control descriptor is not initialized
Warning:
slots event is not supported by the kernel.
^Ccpu-clock: 62021481051 62021480237 62021480237
context-switches: 437 62021478064 62021478064
cpu-migrations: 17 62021475294 62021475294
page-faults: 12 62021471925 62021471925
cycles: 15662273 62020909141 62020909141
instructions: 6580385 62008944246 62008944246
branches: 1446119 62008855550 62008855550
branch-misses: 30970 62008643255 62008643255
failed to read counter slots
failed to read counter topdown-retiring
failed to read counter topdown-bad-spec
failed to read counter topdown-fe-bound
failed to read counter topdown-be-bound

 Performance counter stats for 'system wide':

         62,021.48 msec cpu-clock                        #   16.006 CPUs utilized          
               437      context-switches                 #    7.046 /sec                   
                17      cpu-migrations                   #    0.274 /sec                   
                12      page-faults                      #    0.193 /sec                   
        15,662,273      cycles                           #    0.000 GHz                    
         6,580,385      instructions                     #    0.42  insn per cycle         
         1,446,119      branches                         #   23.316 K/sec                  
            30,970      branch-misses                    #    2.14% of all branches        
   <not supported>      slots                                                       
   <not supported>      topdown-retiring                                            
   <not supported>      topdown-bad-spec                                            
   <not supported>      topdown-fe-bound                                            
   <not supported>      topdown-be-bound                                            

       3.874991326 seconds time elapsed

Thank you very much!

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 tools/perf/builtin-stat.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Comments

Like Xu Sept. 22, 2022, 8:07 a.m. UTC | #1
On 22/9/2022 3:10 pm, Dongli Zhang wrote:
> There are three options to fix the issue.
> 
> 1. Do not expose /sys/bus/event_source/devices/cpu/events/slots to
> userspace so that pmu_have_event(pmu->name, "slots") returns false.

IMO, the guest PMU driver should be fixed
since it misrepresents emulated hardware capabilities in terms of slots.

> 
> 2. Run cpuid at perf userspace and avoid using 'slots' if it is not
> supported in cpuid.
> 
> 3. Do not fatal perf if the leader is failed. Do not create events for an
> evsel if its leader is already failed.

We may also need this since it's easier and more agile to update the perf tool
than the kernel code or KVM emulated capabilities.

> 
> This RFC patch is with the 3rd option. Would you mind suggesting which
> option is better?
Liang, Kan Sept. 22, 2022, 1:34 p.m. UTC | #2
On 2022-09-22 4:07 a.m., Like Xu wrote:
> On 22/9/2022 3:10 pm, Dongli Zhang wrote:
>> There are three options to fix the issue.
>>
>> 1. Do not expose /sys/bus/event_source/devices/cpu/events/slots to
>> userspace so that pmu_have_event(pmu->name, "slots") returns false.
> 
> IMO, the guest PMU driver should be fixed
> since it misrepresents emulated hardware capabilities in terms of slots.

Yes, we need to fix the kernel to hide the slots event if it's not
available.

The patch as below should fix it. (Not tested yet)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index b20e646c8205..27ee43faba32 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -5565,6 +5565,19 @@ static struct attribute *intel_pmu_attrs[] = {
 	NULL,
 };

+static umode_t
+td_is_visible(struct kobject *kobj, struct attribute *attr, int i)
+{
+	/*
+	 * Hide the perf metrics topdown events
+	 * if the slots is not in CPUID.
+	 */
+	if (x86_pmu.num_topdown_events)
+		return (x86_pmu.intel_ctrl & INTEL_PMC_MSK_FIXED_SLOTS) ? attr->mode : 0;
+
+	return attr->mode;
+}
+
 static umode_t
 tsx_is_visible(struct kobject *kobj, struct attribute *attr, int i)
 {
@@ -5600,6 +5613,7 @@ default_is_visible(struct kobject *kobj, struct
attribute *attr, int i)

 static struct attribute_group group_events_td  = {
 	.name = "events",
+	.is_visible = td_is_visible,
 };

 static struct attribute_group group_events_mem = {
@@ -5758,6 +5772,23 @@ static inline int
hybrid_find_supported_cpu(struct x86_hybrid_pmu *pmu)
 	return (cpu >= nr_cpu_ids) ? -1 : cpu;
 }

+static umode_t hybrid_td_is_visible(struct kobject *kobj,
+					struct attribute *attr, int i)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct x86_hybrid_pmu *pmu =
+		 container_of(dev_get_drvdata(dev), struct x86_hybrid_pmu, pmu);
+
+	if (!is_attr_for_this_pmu(kobj, attr))
+		return 0;
+
+	/* Only check the big core which supports perf metrics */
+	if (pmu->cpu_type == hybrid_big)
+		return (pmu->intel_ctrl & INTEL_PMC_MSK_FIXED_SLOTS) ? attr->mode : 0;
+
+	return attr->mode;
+}
+
 static umode_t hybrid_tsx_is_visible(struct kobject *kobj,
 				     struct attribute *attr, int i)
 {
@@ -5784,7 +5815,7 @@ static umode_t hybrid_format_is_visible(struct
kobject *kobj,

 static struct attribute_group hybrid_group_events_td  = {
 	.name		= "events",
-	.is_visible	= hybrid_events_is_visible,
+	.is_visible	= hybrid_td_is_visible,
 };

 static struct attribute_group hybrid_group_events_mem = {


Thanks,
Kan
Dongli Zhang Sept. 22, 2022, 6 p.m. UTC | #3
Hi Kan,

I have tested that the patch works by hiding 'slots' sysfs entries.

# ll /sys/bus/event_source/devices/cpu/events/
total 0
-r--r--r--. 1 root root 4096 Sep 22 17:57 branch-instructions
-r--r--r--. 1 root root 4096 Sep 22 17:57 branch-misses
-r--r--r--. 1 root root 4096 Sep 22 17:57 bus-cycles
-r--r--r--. 1 root root 4096 Sep 22 17:57 cache-misses
-r--r--r--. 1 root root 4096 Sep 22 17:57 cache-references
-r--r--r--. 1 root root 4096 Sep 22 17:57 cpu-cycles
-r--r--r--. 1 root root 4096 Sep 22 17:57 instructions
-r--r--r--. 1 root root 4096 Sep 22 17:57 ref-cycles

# perf stat
^C
 Performance counter stats for 'system wide':

         19,256.20 msec cpu-clock                        #   16.025 CPUs
utilized
               179      context-switches                 #    9.296 /sec

                17      cpu-migrations                   #    0.883 /sec

                 3      page-faults                      #    0.156 /sec

         7,502,294      cycles                           #    0.000 GHz

         2,512,587      instructions                     #    0.33  insn per
cycle
           552,989      branches                         #   28.717 K/sec

            15,999      branch-misses                    #    2.89% of all
branches

       1.201628129 seconds time elapsed


Would you send this patch to fix at the kernel side?

Thank you very much!

Dongli Zhang

On 9/22/22 6:34 AM, Liang, Kan wrote:
> 
> 
> On 2022-09-22 4:07 a.m., Like Xu wrote:
>> On 22/9/2022 3:10 pm, Dongli Zhang wrote:
>>> There are three options to fix the issue.
>>>
>>> 1. Do not expose /sys/bus/event_source/devices/cpu/events/slots to
>>> userspace so that pmu_have_event(pmu->name, "slots") returns false.
>>
>> IMO, the guest PMU driver should be fixed
>> since it misrepresents emulated hardware capabilities in terms of slots.
> 
> Yes, we need to fix the kernel to hide the slots event if it's not
> available.
> 
> The patch as below should fix it. (Not tested yet)
> 
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index b20e646c8205..27ee43faba32 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -5565,6 +5565,19 @@ static struct attribute *intel_pmu_attrs[] = {
>  	NULL,
>  };
> 
> +static umode_t
> +td_is_visible(struct kobject *kobj, struct attribute *attr, int i)
> +{
> +	/*
> +	 * Hide the perf metrics topdown events
> +	 * if the slots is not in CPUID.
> +	 */
> +	if (x86_pmu.num_topdown_events)
> +		return (x86_pmu.intel_ctrl & INTEL_PMC_MSK_FIXED_SLOTS) ? attr->mode : 0;
> +
> +	return attr->mode;
> +}
> +
>  static umode_t
>  tsx_is_visible(struct kobject *kobj, struct attribute *attr, int i)
>  {
> @@ -5600,6 +5613,7 @@ default_is_visible(struct kobject *kobj, struct
> attribute *attr, int i)
> 
>  static struct attribute_group group_events_td  = {
>  	.name = "events",
> +	.is_visible = td_is_visible,
>  };
> 
>  static struct attribute_group group_events_mem = {
> @@ -5758,6 +5772,23 @@ static inline int
> hybrid_find_supported_cpu(struct x86_hybrid_pmu *pmu)
>  	return (cpu >= nr_cpu_ids) ? -1 : cpu;
>  }
> 
> +static umode_t hybrid_td_is_visible(struct kobject *kobj,
> +					struct attribute *attr, int i)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct x86_hybrid_pmu *pmu =
> +		 container_of(dev_get_drvdata(dev), struct x86_hybrid_pmu, pmu);
> +
> +	if (!is_attr_for_this_pmu(kobj, attr))
> +		return 0;
> +
> +	/* Only check the big core which supports perf metrics */
> +	if (pmu->cpu_type == hybrid_big)
> +		return (pmu->intel_ctrl & INTEL_PMC_MSK_FIXED_SLOTS) ? attr->mode : 0;
> +
> +	return attr->mode;
> +}
> +
>  static umode_t hybrid_tsx_is_visible(struct kobject *kobj,
>  				     struct attribute *attr, int i)
>  {
> @@ -5784,7 +5815,7 @@ static umode_t hybrid_format_is_visible(struct
> kobject *kobj,
> 
>  static struct attribute_group hybrid_group_events_td  = {
>  	.name		= "events",
> -	.is_visible	= hybrid_events_is_visible,
> +	.is_visible	= hybrid_td_is_visible,
>  };
> 
>  static struct attribute_group hybrid_group_events_mem = {
> 
> 
> Thanks,
> Kan
>
Dongli Zhang Sept. 22, 2022, 6:10 p.m. UTC | #4
On 9/22/22 1:07 AM, Like Xu wrote:
> On 22/9/2022 3:10 pm, Dongli Zhang wrote:
>> There are three options to fix the issue.
>>
>> 1. Do not expose /sys/bus/event_source/devices/cpu/events/slots to
>> userspace so that pmu_have_event(pmu->name, "slots") returns false.
> 
> IMO, the guest PMU driver should be fixed
> since it misrepresents emulated hardware capabilities in terms of slots.
> 
>>
>> 2. Run cpuid at perf userspace and avoid using 'slots' if it is not
>> supported in cpuid.
>>
>> 3. Do not fatal perf if the leader is failed. Do not create events for an
>> evsel if its leader is already failed.
> 
> We may also need this since it's easier and more agile to update the perf tool
> than the kernel code or KVM emulated capabilities.

Thank you very much for the suggestion. I may remove the RFC and re-send the
patch for perf userspace.

BTW, I used to try your prior patch at KVM side (suggested by Bert) but it does
not help in this case. The VM kernel is already aware that 'slots' is not
supported by the hypervisor.

https://lore.kernel.org/lkml/20220106032118.34459-1-likexu@tencent.com/T/

Dongli Zhang

> 
>>
>> This RFC patch is with the 3rd option. Would you mind suggesting which
>> option is better?
Liang, Kan Sept. 22, 2022, 6:42 p.m. UTC | #5
On 2022-09-22 2:00 p.m., Dongli Zhang wrote:
> Hi Kan,
> 
> I have tested that the patch works by hiding 'slots' sysfs entries.
> 
> # ll /sys/bus/event_source/devices/cpu/events/
> total 0
> -r--r--r--. 1 root root 4096 Sep 22 17:57 branch-instructions
> -r--r--r--. 1 root root 4096 Sep 22 17:57 branch-misses
> -r--r--r--. 1 root root 4096 Sep 22 17:57 bus-cycles
> -r--r--r--. 1 root root 4096 Sep 22 17:57 cache-misses
> -r--r--r--. 1 root root 4096 Sep 22 17:57 cache-references
> -r--r--r--. 1 root root 4096 Sep 22 17:57 cpu-cycles
> -r--r--r--. 1 root root 4096 Sep 22 17:57 instructions
> -r--r--r--. 1 root root 4096 Sep 22 17:57 ref-cycles
> 
> # perf stat
> ^C
>  Performance counter stats for 'system wide':
> 
>          19,256.20 msec cpu-clock                        #   16.025 CPUs
> utilized
>                179      context-switches                 #    9.296 /sec
> 
>                 17      cpu-migrations                   #    0.883 /sec
> 
>                  3      page-faults                      #    0.156 /sec
> 
>          7,502,294      cycles                           #    0.000 GHz
> 
>          2,512,587      instructions                     #    0.33  insn per
> cycle
>            552,989      branches                         #   28.717 K/sec
> 
>             15,999      branch-misses                    #    2.89% of all
> branches
> 
>        1.201628129 seconds time elapsed
> 
> 
> Would you send this patch to fix at the kernel side?

Yes, I will send it out shortly.


Thanks,
Kan
> 
> Thank you very much!
> 
> Dongli Zhang
> 
> On 9/22/22 6:34 AM, Liang, Kan wrote:
>>
>>
>> On 2022-09-22 4:07 a.m., Like Xu wrote:
>>> On 22/9/2022 3:10 pm, Dongli Zhang wrote:
>>>> There are three options to fix the issue.
>>>>
>>>> 1. Do not expose /sys/bus/event_source/devices/cpu/events/slots to
>>>> userspace so that pmu_have_event(pmu->name, "slots") returns false.
>>>
>>> IMO, the guest PMU driver should be fixed
>>> since it misrepresents emulated hardware capabilities in terms of slots.
>>
>> Yes, we need to fix the kernel to hide the slots event if it's not
>> available.
>>
>> The patch as below should fix it. (Not tested yet)
>>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index b20e646c8205..27ee43faba32 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -5565,6 +5565,19 @@ static struct attribute *intel_pmu_attrs[] = {
>>  	NULL,
>>  };
>>
>> +static umode_t
>> +td_is_visible(struct kobject *kobj, struct attribute *attr, int i)
>> +{
>> +	/*
>> +	 * Hide the perf metrics topdown events
>> +	 * if the slots is not in CPUID.
>> +	 */
>> +	if (x86_pmu.num_topdown_events)
>> +		return (x86_pmu.intel_ctrl & INTEL_PMC_MSK_FIXED_SLOTS) ? attr->mode : 0;
>> +
>> +	return attr->mode;
>> +}
>> +
>>  static umode_t
>>  tsx_is_visible(struct kobject *kobj, struct attribute *attr, int i)
>>  {
>> @@ -5600,6 +5613,7 @@ default_is_visible(struct kobject *kobj, struct
>> attribute *attr, int i)
>>
>>  static struct attribute_group group_events_td  = {
>>  	.name = "events",
>> +	.is_visible = td_is_visible,
>>  };
>>
>>  static struct attribute_group group_events_mem = {
>> @@ -5758,6 +5772,23 @@ static inline int
>> hybrid_find_supported_cpu(struct x86_hybrid_pmu *pmu)
>>  	return (cpu >= nr_cpu_ids) ? -1 : cpu;
>>  }
>>
>> +static umode_t hybrid_td_is_visible(struct kobject *kobj,
>> +					struct attribute *attr, int i)
>> +{
>> +	struct device *dev = kobj_to_dev(kobj);
>> +	struct x86_hybrid_pmu *pmu =
>> +		 container_of(dev_get_drvdata(dev), struct x86_hybrid_pmu, pmu);
>> +
>> +	if (!is_attr_for_this_pmu(kobj, attr))
>> +		return 0;
>> +
>> +	/* Only check the big core which supports perf metrics */
>> +	if (pmu->cpu_type == hybrid_big)
>> +		return (pmu->intel_ctrl & INTEL_PMC_MSK_FIXED_SLOTS) ? attr->mode : 0;
>> +
>> +	return attr->mode;
>> +}
>> +
>>  static umode_t hybrid_tsx_is_visible(struct kobject *kobj,
>>  				     struct attribute *attr, int i)
>>  {
>> @@ -5784,7 +5815,7 @@ static umode_t hybrid_format_is_visible(struct
>> kobject *kobj,
>>
>>  static struct attribute_group hybrid_group_events_td  = {
>>  	.name		= "events",
>> -	.is_visible	= hybrid_events_is_visible,
>> +	.is_visible	= hybrid_td_is_visible,
>>  };
>>
>>  static struct attribute_group hybrid_group_events_mem = {
>>
>>
>> Thanks,
>> Kan
>>
Namhyung Kim Oct. 14, 2022, 10:16 p.m. UTC | #6
Hello,

On Thu, Sep 22, 2022 at 12:10 AM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>
> Add kvm@vger.kernel.org as this issue is in virtualization env.
>
> The topdown metrics events became default since
> commit 42641d6f4d15 ("perf stat: Add Topdown metrics events as default
> events"). The perf will use 'slots' if the
> /sys/bus/event_source/devices/cpu/events/slots is available.
>
> Unfortunately, the 'slots' may not be supported in the virualization
> environment. The hypervisor may not expose the 'slots' counter to the VM
> in cpuid. As a result, the kernel may disable topdown slots and metrics
> events in intel_pmu_init() if slots event is not in CPUID. E.g., both
> c->weight and c->idxmsk64 are set to 0.
>
> There will be below error on Icelake VM since 'slots' is the leader:
>
> $ perf stat
> Error:
> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (slots).
> /bin/dmesg | grep -i perf may provide additional information.
>
> This is because the stat_handle_error() returns COUNTER_FATAL when the
> 'slots' is used as leader of events.
>
> There are three options to fix the issue.
>
> 1. Do not expose /sys/bus/event_source/devices/cpu/events/slots to
> userspace so that pmu_have_event(pmu->name, "slots") returns false.
>
> 2. Run cpuid at perf userspace and avoid using 'slots' if it is not
> supported in cpuid.
>
> 3. Do not fatal perf if the leader is failed. Do not create events for an
> evsel if its leader is already failed.
>
> This RFC patch is with the 3rd option. Would you mind suggesting which
> option is better?

Sorry for the late reply but I think option 1 is the way to go.

The option 3 would be a transient workaround and it would affect
other events too.  If it's really needed, I think option 2 is slightly better
than option 3.  Or, we can add --force option to skip non-supported
events explicitly.

Thanks,
Namhyung

>
> Here is the output of patch.
>
> $ perf stat -v
> Using CPUID GenuineIntel-6-6A-6
> slots -> cpu/event=0,umask=0x4/
> topdown-retiring -> cpu/event=0,umask=0x80/
> topdown-bad-spec -> cpu/event=0,umask=0x81/
> topdown-fe-bound -> cpu/event=0,umask=0x82/
> topdown-be-bound -> cpu/event=0,umask=0x83/
> Control descriptor is not initialized
> Warning:
> slots event is not supported by the kernel.
> ^Ccpu-clock: 62021481051 62021480237 62021480237
> context-switches: 437 62021478064 62021478064
> cpu-migrations: 17 62021475294 62021475294
> page-faults: 12 62021471925 62021471925
> cycles: 15662273 62020909141 62020909141
> instructions: 6580385 62008944246 62008944246
> branches: 1446119 62008855550 62008855550
> branch-misses: 30970 62008643255 62008643255
> failed to read counter slots
> failed to read counter topdown-retiring
> failed to read counter topdown-bad-spec
> failed to read counter topdown-fe-bound
> failed to read counter topdown-be-bound
>
>  Performance counter stats for 'system wide':
>
>          62,021.48 msec cpu-clock                        #   16.006 CPUs utilized
>                437      context-switches                 #    7.046 /sec
>                 17      cpu-migrations                   #    0.274 /sec
>                 12      page-faults                      #    0.193 /sec
>         15,662,273      cycles                           #    0.000 GHz
>          6,580,385      instructions                     #    0.42  insn per cycle
>          1,446,119      branches                         #   23.316 K/sec
>             30,970      branch-misses                    #    2.14% of all branches
>    <not supported>      slots
>    <not supported>      topdown-retiring
>    <not supported>      topdown-bad-spec
>    <not supported>      topdown-fe-bound
>    <not supported>      topdown-be-bound
>
>        3.874991326 seconds time elapsed
>
> Thank you very much!
>
> Cc: Joe Jin <joe.jin@oracle.com>
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
>  tools/perf/builtin-stat.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 0b4a62e4ff67..1053cf0886c0 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -762,9 +762,7 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
>                  */
>                 counter->errored = true;
>
> -               if ((evsel__leader(counter) != counter) ||
> -                   !(counter->core.leader->nr_members > 1))
> -                       return COUNTER_SKIP;
> +               return COUNTER_SKIP;
>         } else if (evsel__fallback(counter, errno, msg, sizeof(msg))) {
>                 if (verbose > 0)
>                         ui__warning("%s\n", msg);
> @@ -843,6 +841,8 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>                 if (target.use_bpf)
>                         break;
>
> +               if (evsel__leader(counter)->errored)
> +                       continue;
>                 if (counter->reset_group || counter->errored)
>                         continue;
>                 if (evsel__is_bpf(counter))
> @@ -901,6 +901,8 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>                 evlist__for_each_cpu(evlist_cpu_itr, evsel_list, affinity) {
>                         counter = evlist_cpu_itr.evsel;
>
> +                       if (evsel__leader(counter)->errored)
> +                               continue;
>                         if (!counter->reset_group && !counter->errored)
>                                 continue;
>                         if (!counter->reset_group)
> --
> 2.17.1
>
Dongli Zhang Oct. 14, 2022, 11:47 p.m. UTC | #7
Hi Namhyung,

On 10/14/22 3:16 PM, Namhyung Kim wrote:
> Hello,
> 
> On Thu, Sep 22, 2022 at 12:10 AM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>>
>> Add kvm@vger.kernel.org as this issue is in virtualization env.
>>
>> The topdown metrics events became default since
>> commit 42641d6f4d15 ("perf stat: Add Topdown metrics events as default
>> events"). The perf will use 'slots' if the
>> /sys/bus/event_source/devices/cpu/events/slots is available.
>>
>> Unfortunately, the 'slots' may not be supported in the virualization
>> environment. The hypervisor may not expose the 'slots' counter to the VM
>> in cpuid. As a result, the kernel may disable topdown slots and metrics
>> events in intel_pmu_init() if slots event is not in CPUID. E.g., both
>> c->weight and c->idxmsk64 are set to 0.
>>
>> There will be below error on Icelake VM since 'slots' is the leader:
>>
>> $ perf stat
>> Error:
>> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (slots).
>> /bin/dmesg | grep -i perf may provide additional information.
>>
>> This is because the stat_handle_error() returns COUNTER_FATAL when the
>> 'slots' is used as leader of events.
>>
>> There are three options to fix the issue.
>>
>> 1. Do not expose /sys/bus/event_source/devices/cpu/events/slots to
>> userspace so that pmu_have_event(pmu->name, "slots") returns false.
>>
>> 2. Run cpuid at perf userspace and avoid using 'slots' if it is not
>> supported in cpuid.
>>
>> 3. Do not fatal perf if the leader is failed. Do not create events for an
>> evsel if its leader is already failed.
>>
>> This RFC patch is with the 3rd option. Would you mind suggesting which
>> option is better?
> 
> Sorry for the late reply but I think option 1 is the way to go.
> 
> The option 3 would be a transient workaround and it would affect
> other events too.  If it's really needed, I think option 2 is slightly better
> than option 3.  Or, we can add --force option to skip non-supported
> events explicitly.

About option 2 and 3, I will wait for more comments if anyone still prefers the
change at userspace, e.g., the "--force" option to avoid failure but just to
report non-supported.

I have already sent a version that may impact other events. Please ignore as you
think option 1 is enough.

https://lore.kernel.org/all/20221010050113.13050-1-dongli.zhang@oracle.com/


About option 1, there is a bugfix from Ken pending for review.

https://lore.kernel.org/all/20220922201505.2721654-1-kan.liang@linux.intel.com/

Thank you very much!

Dongli Zhang

> 
> Thanks,
> Namhyung
> 
>>
>> Here is the output of patch.
>>
>> $ perf stat -v
>> Using CPUID GenuineIntel-6-6A-6
>> slots -> cpu/event=0,umask=0x4/
>> topdown-retiring -> cpu/event=0,umask=0x80/
>> topdown-bad-spec -> cpu/event=0,umask=0x81/
>> topdown-fe-bound -> cpu/event=0,umask=0x82/
>> topdown-be-bound -> cpu/event=0,umask=0x83/
>> Control descriptor is not initialized
>> Warning:
>> slots event is not supported by the kernel.
>> ^Ccpu-clock: 62021481051 62021480237 62021480237
>> context-switches: 437 62021478064 62021478064
>> cpu-migrations: 17 62021475294 62021475294
>> page-faults: 12 62021471925 62021471925
>> cycles: 15662273 62020909141 62020909141
>> instructions: 6580385 62008944246 62008944246
>> branches: 1446119 62008855550 62008855550
>> branch-misses: 30970 62008643255 62008643255
>> failed to read counter slots
>> failed to read counter topdown-retiring
>> failed to read counter topdown-bad-spec
>> failed to read counter topdown-fe-bound
>> failed to read counter topdown-be-bound
>>
>>  Performance counter stats for 'system wide':
>>
>>          62,021.48 msec cpu-clock                        #   16.006 CPUs utilized
>>                437      context-switches                 #    7.046 /sec
>>                 17      cpu-migrations                   #    0.274 /sec
>>                 12      page-faults                      #    0.193 /sec
>>         15,662,273      cycles                           #    0.000 GHz
>>          6,580,385      instructions                     #    0.42  insn per cycle
>>          1,446,119      branches                         #   23.316 K/sec
>>             30,970      branch-misses                    #    2.14% of all branches
>>    <not supported>      slots
>>    <not supported>      topdown-retiring
>>    <not supported>      topdown-bad-spec
>>    <not supported>      topdown-fe-bound
>>    <not supported>      topdown-be-bound
>>
>>        3.874991326 seconds time elapsed
>>
>> Thank you very much!
>>
>> Cc: Joe Jin <joe.jin@oracle.com>
>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>> ---
>>  tools/perf/builtin-stat.c | 8 +++++---
>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index 0b4a62e4ff67..1053cf0886c0 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -762,9 +762,7 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
>>                  */
>>                 counter->errored = true;
>>
>> -               if ((evsel__leader(counter) != counter) ||
>> -                   !(counter->core.leader->nr_members > 1))
>> -                       return COUNTER_SKIP;
>> +               return COUNTER_SKIP;
>>         } else if (evsel__fallback(counter, errno, msg, sizeof(msg))) {
>>                 if (verbose > 0)
>>                         ui__warning("%s\n", msg);
>> @@ -843,6 +841,8 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>>                 if (target.use_bpf)
>>                         break;
>>
>> +               if (evsel__leader(counter)->errored)
>> +                       continue;
>>                 if (counter->reset_group || counter->errored)
>>                         continue;
>>                 if (evsel__is_bpf(counter))
>> @@ -901,6 +901,8 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>>                 evlist__for_each_cpu(evlist_cpu_itr, evsel_list, affinity) {
>>                         counter = evlist_cpu_itr.evsel;
>>
>> +                       if (evsel__leader(counter)->errored)
>> +                               continue;
>>                         if (!counter->reset_group && !counter->errored)
>>                                 continue;
>>                         if (!counter->reset_group)
>> --
>> 2.17.1
>>
Dongli Zhang Oct. 18, 2022, 9:29 p.m. UTC | #8
On 10/14/22 3:16 PM, Namhyung Kim wrote:
> Hello,
> 
> On Thu, Sep 22, 2022 at 12:10 AM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>>
>> Add kvm@vger.kernel.org as this issue is in virtualization env.
>>
>> The topdown metrics events became default since
>> commit 42641d6f4d15 ("perf stat: Add Topdown metrics events as default
>> events"). The perf will use 'slots' if the
>> /sys/bus/event_source/devices/cpu/events/slots is available.
>>
>> Unfortunately, the 'slots' may not be supported in the virualization
>> environment. The hypervisor may not expose the 'slots' counter to the VM
>> in cpuid. As a result, the kernel may disable topdown slots and metrics
>> events in intel_pmu_init() if slots event is not in CPUID. E.g., both
>> c->weight and c->idxmsk64 are set to 0.
>>
>> There will be below error on Icelake VM since 'slots' is the leader:
>>
>> $ perf stat
>> Error:
>> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (slots).
>> /bin/dmesg | grep -i perf may provide additional information.
>>
>> This is because the stat_handle_error() returns COUNTER_FATAL when the
>> 'slots' is used as leader of events.
>>
>> There are three options to fix the issue.
>>
>> 1. Do not expose /sys/bus/event_source/devices/cpu/events/slots to
>> userspace so that pmu_have_event(pmu->name, "slots") returns false.
>>
>> 2. Run cpuid at perf userspace and avoid using 'slots' if it is not
>> supported in cpuid.
>>
>> 3. Do not fatal perf if the leader is failed. Do not create events for an
>> evsel if its leader is already failed.
>>
>> This RFC patch is with the 3rd option. Would you mind suggesting which
>> option is better?
> 
> Sorry for the late reply but I think option 1 is the way to go.
> 
> The option 3 would be a transient workaround and it would affect
> other events too.  If it's really needed, I think option 2 is slightly better
> than option 3.  Or, we can add --force option to skip non-supported
> events explicitly.
> 
> Thanks,
> Namhyung
> 
>>
>> Here is the output of patch.
>>
>> $ perf stat -v
>> Using CPUID GenuineIntel-6-6A-6
>> slots -> cpu/event=0,umask=0x4/
>> topdown-retiring -> cpu/event=0,umask=0x80/
>> topdown-bad-spec -> cpu/event=0,umask=0x81/
>> topdown-fe-bound -> cpu/event=0,umask=0x82/
>> topdown-be-bound -> cpu/event=0,umask=0x83/
>> Control descriptor is not initialized
>> Warning:
>> slots event is not supported by the kernel.
>> ^Ccpu-clock: 62021481051 62021480237 62021480237
>> context-switches: 437 62021478064 62021478064
>> cpu-migrations: 17 62021475294 62021475294
>> page-faults: 12 62021471925 62021471925
>> cycles: 15662273 62020909141 62020909141
>> instructions: 6580385 62008944246 62008944246
>> branches: 1446119 62008855550 62008855550
>> branch-misses: 30970 62008643255 62008643255
>> failed to read counter slots
>> failed to read counter topdown-retiring
>> failed to read counter topdown-bad-spec
>> failed to read counter topdown-fe-bound
>> failed to read counter topdown-be-bound
>>
>>  Performance counter stats for 'system wide':
>>
>>          62,021.48 msec cpu-clock                        #   16.006 CPUs utilized
>>                437      context-switches                 #    7.046 /sec
>>                 17      cpu-migrations                   #    0.274 /sec
>>                 12      page-faults                      #    0.193 /sec
>>         15,662,273      cycles                           #    0.000 GHz
>>          6,580,385      instructions                     #    0.42  insn per cycle
>>          1,446,119      branches                         #   23.316 K/sec
>>             30,970      branch-misses                    #    2.14% of all branches
>>    <not supported>      slots
>>    <not supported>      topdown-retiring
>>    <not supported>      topdown-bad-spec
>>    <not supported>      topdown-fe-bound
>>    <not supported>      topdown-be-bound
>>
>>        3.874991326 seconds time elapsed
>>
>> Thank you very much!
>>
>> Cc: Joe Jin <joe.jin@oracle.com>
>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>> ---
>>  tools/perf/builtin-stat.c | 8 +++++---
>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index 0b4a62e4ff67..1053cf0886c0 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -762,9 +762,7 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
>>                  */
>>                 counter->errored = true;
>>
>> -               if ((evsel__leader(counter) != counter) ||
>> -                   !(counter->core.leader->nr_members > 1))
>> -                       return COUNTER_SKIP;
>> +               return COUNTER_SKIP;
>>         } else if (evsel__fallback(counter, errno, msg, sizeof(msg))) {
>>                 if (verbose > 0)
>>                         ui__warning("%s\n", msg);
>> @@ -843,6 +841,8 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>>                 if (target.use_bpf)
>>                         break;
>>
>> +               if (evsel__leader(counter)->errored)
>> +                       continue;
>>                 if (counter->reset_group || counter->errored)
>>                         continue;
>>                 if (evsel__is_bpf(counter))
>> @@ -901,6 +901,8 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>>                 evlist__for_each_cpu(evlist_cpu_itr, evsel_list, affinity) {
>>                         counter = evlist_cpu_itr.evsel;
>>
>> +                       if (evsel__leader(counter)->errored)
>> +                               continue;
>>                         if (!counter->reset_group && !counter->errored)
>>                                 continue;
>>                         if (!counter->reset_group)
>> --
>> 2.17.1
>>
Dongli Zhang Oct. 18, 2022, 9:31 p.m. UTC | #9
Please ignore this email. I carelessly clicked unwanted buttons on my
Thunderbird client.

Sorry for spamming.

Dongli Zhang

On 10/18/22 2:29 PM, Dongli Zhang wrote:
> 
> 
> On 10/14/22 3:16 PM, Namhyung Kim wrote:
>> Hello,
>>
>> On Thu, Sep 22, 2022 at 12:10 AM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>>>
>>> Add kvm@vger.kernel.org as this issue is in virtualization env.
>>>
>>> The topdown metrics events became default since
>>> commit 42641d6f4d15 ("perf stat: Add Topdown metrics events as default
>>> events"). The perf will use 'slots' if the
>>> /sys/bus/event_source/devices/cpu/events/slots is available.
>>>
>>> Unfortunately, the 'slots' may not be supported in the virualization
>>> environment. The hypervisor may not expose the 'slots' counter to the VM
>>> in cpuid. As a result, the kernel may disable topdown slots and metrics
>>> events in intel_pmu_init() if slots event is not in CPUID. E.g., both
>>> c->weight and c->idxmsk64 are set to 0.
>>>
>>> There will be below error on Icelake VM since 'slots' is the leader:
>>>
>>> $ perf stat
>>> Error:
>>> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (slots).
>>> /bin/dmesg | grep -i perf may provide additional information.
>>>
>>> This is because the stat_handle_error() returns COUNTER_FATAL when the
>>> 'slots' is used as leader of events.
>>>
>>> There are three options to fix the issue.
>>>
>>> 1. Do not expose /sys/bus/event_source/devices/cpu/events/slots to
>>> userspace so that pmu_have_event(pmu->name, "slots") returns false.
>>>
>>> 2. Run cpuid at perf userspace and avoid using 'slots' if it is not
>>> supported in cpuid.
>>>
>>> 3. Do not fatal perf if the leader is failed. Do not create events for an
>>> evsel if its leader is already failed.
>>>
>>> This RFC patch is with the 3rd option. Would you mind suggesting which
>>> option is better?
>>
>> Sorry for the late reply but I think option 1 is the way to go.
>>
>> The option 3 would be a transient workaround and it would affect
>> other events too.  If it's really needed, I think option 2 is slightly better
>> than option 3.  Or, we can add --force option to skip non-supported
>> events explicitly.
>>
>> Thanks,
>> Namhyung
>>
>>>
>>> Here is the output of patch.
>>>
>>> $ perf stat -v
>>> Using CPUID GenuineIntel-6-6A-6
>>> slots -> cpu/event=0,umask=0x4/
>>> topdown-retiring -> cpu/event=0,umask=0x80/
>>> topdown-bad-spec -> cpu/event=0,umask=0x81/
>>> topdown-fe-bound -> cpu/event=0,umask=0x82/
>>> topdown-be-bound -> cpu/event=0,umask=0x83/
>>> Control descriptor is not initialized
>>> Warning:
>>> slots event is not supported by the kernel.
>>> ^Ccpu-clock: 62021481051 62021480237 62021480237
>>> context-switches: 437 62021478064 62021478064
>>> cpu-migrations: 17 62021475294 62021475294
>>> page-faults: 12 62021471925 62021471925
>>> cycles: 15662273 62020909141 62020909141
>>> instructions: 6580385 62008944246 62008944246
>>> branches: 1446119 62008855550 62008855550
>>> branch-misses: 30970 62008643255 62008643255
>>> failed to read counter slots
>>> failed to read counter topdown-retiring
>>> failed to read counter topdown-bad-spec
>>> failed to read counter topdown-fe-bound
>>> failed to read counter topdown-be-bound
>>>
>>>  Performance counter stats for 'system wide':
>>>
>>>          62,021.48 msec cpu-clock                        #   16.006 CPUs utilized
>>>                437      context-switches                 #    7.046 /sec
>>>                 17      cpu-migrations                   #    0.274 /sec
>>>                 12      page-faults                      #    0.193 /sec
>>>         15,662,273      cycles                           #    0.000 GHz
>>>          6,580,385      instructions                     #    0.42  insn per cycle
>>>          1,446,119      branches                         #   23.316 K/sec
>>>             30,970      branch-misses                    #    2.14% of all branches
>>>    <not supported>      slots
>>>    <not supported>      topdown-retiring
>>>    <not supported>      topdown-bad-spec
>>>    <not supported>      topdown-fe-bound
>>>    <not supported>      topdown-be-bound
>>>
>>>        3.874991326 seconds time elapsed
>>>
>>> Thank you very much!
>>>
>>> Cc: Joe Jin <joe.jin@oracle.com>
>>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>>> ---
>>>  tools/perf/builtin-stat.c | 8 +++++---
>>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>>> index 0b4a62e4ff67..1053cf0886c0 100644
>>> --- a/tools/perf/builtin-stat.c
>>> +++ b/tools/perf/builtin-stat.c
>>> @@ -762,9 +762,7 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
>>>                  */
>>>                 counter->errored = true;
>>>
>>> -               if ((evsel__leader(counter) != counter) ||
>>> -                   !(counter->core.leader->nr_members > 1))
>>> -                       return COUNTER_SKIP;
>>> +               return COUNTER_SKIP;
>>>         } else if (evsel__fallback(counter, errno, msg, sizeof(msg))) {
>>>                 if (verbose > 0)
>>>                         ui__warning("%s\n", msg);
>>> @@ -843,6 +841,8 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>>>                 if (target.use_bpf)
>>>                         break;
>>>
>>> +               if (evsel__leader(counter)->errored)
>>> +                       continue;
>>>                 if (counter->reset_group || counter->errored)
>>>                         continue;
>>>                 if (evsel__is_bpf(counter))
>>> @@ -901,6 +901,8 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>>>                 evlist__for_each_cpu(evlist_cpu_itr, evsel_list, affinity) {
>>>                         counter = evlist_cpu_itr.evsel;
>>>
>>> +                       if (evsel__leader(counter)->errored)
>>> +                               continue;
>>>                         if (!counter->reset_group && !counter->errored)
>>>                                 continue;
>>>                         if (!counter->reset_group)
>>> --
>>> 2.17.1
>>>
diff mbox series

Patch

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 0b4a62e4ff67..1053cf0886c0 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -762,9 +762,7 @@  static enum counter_recovery stat_handle_error(struct evsel *counter)
 		 */
 		counter->errored = true;
 
-		if ((evsel__leader(counter) != counter) ||
-		    !(counter->core.leader->nr_members > 1))
-			return COUNTER_SKIP;
+		return COUNTER_SKIP;
 	} else if (evsel__fallback(counter, errno, msg, sizeof(msg))) {
 		if (verbose > 0)
 			ui__warning("%s\n", msg);
@@ -843,6 +841,8 @@  static int __run_perf_stat(int argc, const char **argv, int run_idx)
 		if (target.use_bpf)
 			break;
 
+		if (evsel__leader(counter)->errored)
+			continue;
 		if (counter->reset_group || counter->errored)
 			continue;
 		if (evsel__is_bpf(counter))
@@ -901,6 +901,8 @@  static int __run_perf_stat(int argc, const char **argv, int run_idx)
 		evlist__for_each_cpu(evlist_cpu_itr, evsel_list, affinity) {
 			counter = evlist_cpu_itr.evsel;
 
+			if (evsel__leader(counter)->errored)
+				continue;
 			if (!counter->reset_group && !counter->errored)
 				continue;
 			if (!counter->reset_group)