diff mbox series

[1/2] perf stat: Fix segfault when counting armv8_pmu events

Message ID 20200922031346.15051-2-liwei391@huawei.com (mailing list archive)
State New, archived
Headers show
Series perf stat: Unbreak perf stat with ARMv8 PMU events | expand

Commit Message

Wei Li Sept. 22, 2020, 3:13 a.m. UTC
When executing perf stat with armv8_pmu events with a workload, it will
report a segfault as result.

(gdb) bt
#0  0x0000000000603fc8 in perf_evsel__close_fd_cpu (evsel=<optimized out>,
    cpu=<optimized out>) at evsel.c:122
#1  perf_evsel__close_cpu (evsel=evsel@entry=0x716e950, cpu=7) at evsel.c:156
#2  0x00000000004d4718 in evlist__close (evlist=0x70a7cb0) at util/evlist.c:1242
#3  0x0000000000453404 in __run_perf_stat (argc=3, argc@entry=1, argv=0x30,
    argv@entry=0xfffffaea2f90, run_idx=119, run_idx@entry=1701998435)
    at builtin-stat.c:929
#4  0x0000000000455058 in run_perf_stat (run_idx=1701998435, argv=0xfffffaea2f90,
    argc=1) at builtin-stat.c:947
#5  cmd_stat (argc=1, argv=0xfffffaea2f90) at builtin-stat.c:2357
#6  0x00000000004bb888 in run_builtin (p=p@entry=0x9764b8 <commands+288>,
    argc=argc@entry=4, argv=argv@entry=0xfffffaea2f90) at perf.c:312
#7  0x00000000004bbb54 in handle_internal_command (argc=argc@entry=4,
    argv=argv@entry=0xfffffaea2f90) at perf.c:364
#8  0x0000000000435378 in run_argv (argcp=<synthetic pointer>,
    argv=<synthetic pointer>) at perf.c:408
#9  main (argc=4, argv=0xfffffaea2f90) at perf.c:538

After debugging, i found the root reason is that the xyarray fd is created
by evsel__open_per_thread() ignoring the cpu passed in
create_perf_stat_counter(), while the evsel' cpumap is assigned as the
corresponding PMU's cpumap in __add_event(). Thus, the xyarray fd is created
with ncpus of dummy cpumap and an out of bounds 'cpu' index will be used in
perf_evsel__close_fd_cpu().

To address this, add a flag to mark this situation and avoid using the
affinity technique when closing/enabling/disabling events.

Fixes: 7736627b865d ("perf stat: Use affinity for closing file descriptors")
Fixes: 704e2f5b700d ("perf stat: Use affinity for enabling/disabling events")
Signed-off-by: Wei Li <liwei391@huawei.com>
---
 tools/lib/perf/include/internal/evlist.h |  1 +
 tools/perf/builtin-stat.c                |  3 +++
 tools/perf/util/evlist.c                 | 23 ++++++++++++++++++++++-
 3 files changed, 26 insertions(+), 1 deletion(-)

Comments

Andi Kleen Sept. 22, 2020, 7:23 p.m. UTC | #1
> After debugging, i found the root reason is that the xyarray fd is created
> by evsel__open_per_thread() ignoring the cpu passed in
> create_perf_stat_counter(), while the evsel' cpumap is assigned as the
> corresponding PMU's cpumap in __add_event(). Thus, the xyarray fd is created
> with ncpus of dummy cpumap and an out of bounds 'cpu' index will be used in
> perf_evsel__close_fd_cpu().
> 
> To address this, add a flag to mark this situation and avoid using the
> affinity technique when closing/enabling/disabling events.

The flag seems like a hack. How about figuring out the correct number of 
CPUs and using that?

-Andi
Andi Kleen Sept. 22, 2020, 7:50 p.m. UTC | #2
On Tue, Sep 22, 2020 at 12:23:21PM -0700, Andi Kleen wrote:
> > After debugging, i found the root reason is that the xyarray fd is created
> > by evsel__open_per_thread() ignoring the cpu passed in
> > create_perf_stat_counter(), while the evsel' cpumap is assigned as the
> > corresponding PMU's cpumap in __add_event(). Thus, the xyarray fd is created
> > with ncpus of dummy cpumap and an out of bounds 'cpu' index will be used in
> > perf_evsel__close_fd_cpu().
> > 
> > To address this, add a flag to mark this situation and avoid using the
> > affinity technique when closing/enabling/disabling events.
> 
> The flag seems like a hack. How about figuring out the correct number of 
> CPUs and using that?

Also would like to understand what's different on ARM64 than other architectures.
Or could this happen on x86 too?

-Andi
Jiri Olsa Sept. 23, 2020, 5:44 a.m. UTC | #3
On Tue, Sep 22, 2020 at 11:13:45AM +0800, Wei Li wrote:
> When executing perf stat with armv8_pmu events with a workload, it will
> report a segfault as result.

please share the perf stat command line you see that segfault for

thanks,
jirka

> 
> (gdb) bt
> #0  0x0000000000603fc8 in perf_evsel__close_fd_cpu (evsel=<optimized out>,
>     cpu=<optimized out>) at evsel.c:122
> #1  perf_evsel__close_cpu (evsel=evsel@entry=0x716e950, cpu=7) at evsel.c:156
> #2  0x00000000004d4718 in evlist__close (evlist=0x70a7cb0) at util/evlist.c:1242
> #3  0x0000000000453404 in __run_perf_stat (argc=3, argc@entry=1, argv=0x30,
>     argv@entry=0xfffffaea2f90, run_idx=119, run_idx@entry=1701998435)
>     at builtin-stat.c:929
> #4  0x0000000000455058 in run_perf_stat (run_idx=1701998435, argv=0xfffffaea2f90,
>     argc=1) at builtin-stat.c:947
> #5  cmd_stat (argc=1, argv=0xfffffaea2f90) at builtin-stat.c:2357
> #6  0x00000000004bb888 in run_builtin (p=p@entry=0x9764b8 <commands+288>,
>     argc=argc@entry=4, argv=argv@entry=0xfffffaea2f90) at perf.c:312
> #7  0x00000000004bbb54 in handle_internal_command (argc=argc@entry=4,
>     argv=argv@entry=0xfffffaea2f90) at perf.c:364
> #8  0x0000000000435378 in run_argv (argcp=<synthetic pointer>,
>     argv=<synthetic pointer>) at perf.c:408
> #9  main (argc=4, argv=0xfffffaea2f90) at perf.c:538
> 
> After debugging, i found the root reason is that the xyarray fd is created
> by evsel__open_per_thread() ignoring the cpu passed in
> create_perf_stat_counter(), while the evsel' cpumap is assigned as the
> corresponding PMU's cpumap in __add_event(). Thus, the xyarray fd is created
> with ncpus of dummy cpumap and an out of bounds 'cpu' index will be used in
> perf_evsel__close_fd_cpu().
> 
> To address this, add a flag to mark this situation and avoid using the
> affinity technique when closing/enabling/disabling events.
> 
> Fixes: 7736627b865d ("perf stat: Use affinity for closing file descriptors")
> Fixes: 704e2f5b700d ("perf stat: Use affinity for enabling/disabling events")
> Signed-off-by: Wei Li <liwei391@huawei.com>
> ---
>  tools/lib/perf/include/internal/evlist.h |  1 +
>  tools/perf/builtin-stat.c                |  3 +++
>  tools/perf/util/evlist.c                 | 23 ++++++++++++++++++++++-
>  3 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/lib/perf/include/internal/evlist.h b/tools/lib/perf/include/internal/evlist.h
> index 2d0fa02b036f..c02d7e583846 100644
> --- a/tools/lib/perf/include/internal/evlist.h
> +++ b/tools/lib/perf/include/internal/evlist.h
> @@ -17,6 +17,7 @@ struct perf_evlist {
>  	struct list_head	 entries;
>  	int			 nr_entries;
>  	bool			 has_user_cpus;
> +	bool			 open_per_thread;
>  	struct perf_cpu_map	*cpus;
>  	struct perf_cpu_map	*all_cpus;
>  	struct perf_thread_map	*threads;
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index fddc97cac984..6e6ceacce634 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -725,6 +725,9 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>  	if (group)
>  		perf_evlist__set_leader(evsel_list);
>  
> +	if (!(target__has_cpu(&target) && !target__has_per_thread(&target)))
> +		evsel_list->core.open_per_thread = true;
> +
>  	if (affinity__setup(&affinity) < 0)
>  		return -1;
>  
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index e3fa3bf7498a..bf8a3ccc599f 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -383,6 +383,15 @@ void evlist__disable(struct evlist *evlist)
>  	int cpu, i, imm = 0;
>  	bool has_imm = false;
>  
> +	if (evlist->core.open_per_thread) {
> +		evlist__for_each_entry(evlist, pos) {
> +			if (pos->disabled || !evsel__is_group_leader(pos) || !pos->core.fd)
> +				continue;
> +			evsel__disable(pos);
> +		}
> +		goto out;
> +	}
> +
>  	if (affinity__setup(&affinity) < 0)
>  		return;
>  
> @@ -414,6 +423,7 @@ void evlist__disable(struct evlist *evlist)
>  		pos->disabled = true;
>  	}
>  
> +out:
>  	evlist->enabled = false;
>  }
>  
> @@ -423,6 +433,15 @@ void evlist__enable(struct evlist *evlist)
>  	struct affinity affinity;
>  	int cpu, i;
>  
> +	if (evlist->core.open_per_thread) {
> +		evlist__for_each_entry(evlist, pos) {
> +			if (!evsel__is_group_leader(pos) || !pos->core.fd)
> +				continue;
> +			evsel__enable(pos);
> +		}
> +		goto out;
> +	}
> +
>  	if (affinity__setup(&affinity) < 0)
>  		return;
>  
> @@ -444,6 +463,7 @@ void evlist__enable(struct evlist *evlist)
>  		pos->disabled = false;
>  	}
>  
> +out:
>  	evlist->enabled = true;
>  }
>  
> @@ -1223,9 +1243,10 @@ void evlist__close(struct evlist *evlist)
>  
>  	/*
>  	 * With perf record core.cpus is usually NULL.
> +	 * Or perf stat may open events per-thread.
>  	 * Use the old method to handle this for now.
>  	 */
> -	if (!evlist->core.cpus) {
> +	if (evlist->core.open_per_thread || !evlist->core.cpus) {
>  		evlist__for_each_entry_reverse(evlist, evsel)
>  			evsel__close(evsel);
>  		return;
> -- 
> 2.17.1
>
Namhyung Kim Sept. 23, 2020, 1:49 p.m. UTC | #4
On Wed, Sep 23, 2020 at 2:44 PM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Tue, Sep 22, 2020 at 11:13:45AM +0800, Wei Li wrote:
> > When executing perf stat with armv8_pmu events with a workload, it will
> > report a segfault as result.
>
> please share the perf stat command line you see that segfault for

It seems the description in the patch 0/2 already has it:

  [root@localhost hulk]# tools/perf/perf stat  -e
armv8_pmuv3_0/ll_cache_rd/,armv8_pmuv3_0/ll_cache_miss_rd/ ls >
/dev/null
  Segmentation fault

Thanks
Namhyun


>
> thanks,
> jirka
>
> >
> > (gdb) bt
> > #0  0x0000000000603fc8 in perf_evsel__close_fd_cpu (evsel=<optimized out>,
> >     cpu=<optimized out>) at evsel.c:122
> > #1  perf_evsel__close_cpu (evsel=evsel@entry=0x716e950, cpu=7) at evsel.c:156
> > #2  0x00000000004d4718 in evlist__close (evlist=0x70a7cb0) at util/evlist.c:1242
> > #3  0x0000000000453404 in __run_perf_stat (argc=3, argc@entry=1, argv=0x30,
> >     argv@entry=0xfffffaea2f90, run_idx=119, run_idx@entry=1701998435)
> >     at builtin-stat.c:929
> > #4  0x0000000000455058 in run_perf_stat (run_idx=1701998435, argv=0xfffffaea2f90,
> >     argc=1) at builtin-stat.c:947
> > #5  cmd_stat (argc=1, argv=0xfffffaea2f90) at builtin-stat.c:2357
> > #6  0x00000000004bb888 in run_builtin (p=p@entry=0x9764b8 <commands+288>,
> >     argc=argc@entry=4, argv=argv@entry=0xfffffaea2f90) at perf.c:312
> > #7  0x00000000004bbb54 in handle_internal_command (argc=argc@entry=4,
> >     argv=argv@entry=0xfffffaea2f90) at perf.c:364
> > #8  0x0000000000435378 in run_argv (argcp=<synthetic pointer>,
> >     argv=<synthetic pointer>) at perf.c:408
> > #9  main (argc=4, argv=0xfffffaea2f90) at perf.c:538
> >
> > After debugging, i found the root reason is that the xyarray fd is created
> > by evsel__open_per_thread() ignoring the cpu passed in
> > create_perf_stat_counter(), while the evsel' cpumap is assigned as the
> > corresponding PMU's cpumap in __add_event(). Thus, the xyarray fd is created
> > with ncpus of dummy cpumap and an out of bounds 'cpu' index will be used in
> > perf_evsel__close_fd_cpu().
> >
> > To address this, add a flag to mark this situation and avoid using the
> > affinity technique when closing/enabling/disabling events.
> >
> > Fixes: 7736627b865d ("perf stat: Use affinity for closing file descriptors")
> > Fixes: 704e2f5b700d ("perf stat: Use affinity for enabling/disabling events")
> > Signed-off-by: Wei Li <liwei391@huawei.com>
> > ---
Jiri Olsa Sept. 23, 2020, 2:07 p.m. UTC | #5
On Wed, Sep 23, 2020 at 10:49:52PM +0900, Namhyung Kim wrote:
> On Wed, Sep 23, 2020 at 2:44 PM Jiri Olsa <jolsa@redhat.com> wrote:
> >
> > On Tue, Sep 22, 2020 at 11:13:45AM +0800, Wei Li wrote:
> > > When executing perf stat with armv8_pmu events with a workload, it will
> > > report a segfault as result.
> >
> > please share the perf stat command line you see that segfault for
> 
> It seems the description in the patch 0/2 already has it:
> 
>   [root@localhost hulk]# tools/perf/perf stat  -e
> armv8_pmuv3_0/ll_cache_rd/,armv8_pmuv3_0/ll_cache_miss_rd/ ls >
> /dev/null
>   Segmentation fault

yea I found it, but can't reproduce it.. I see the issue from
patch 2, but not sure what's the problem so far

jirka

> 
> Thanks
> Namhyun
> 
> 
> >
> > thanks,
> > jirka
> >
> > >
> > > (gdb) bt
> > > #0  0x0000000000603fc8 in perf_evsel__close_fd_cpu (evsel=<optimized out>,
> > >     cpu=<optimized out>) at evsel.c:122
> > > #1  perf_evsel__close_cpu (evsel=evsel@entry=0x716e950, cpu=7) at evsel.c:156
> > > #2  0x00000000004d4718 in evlist__close (evlist=0x70a7cb0) at util/evlist.c:1242
> > > #3  0x0000000000453404 in __run_perf_stat (argc=3, argc@entry=1, argv=0x30,
> > >     argv@entry=0xfffffaea2f90, run_idx=119, run_idx@entry=1701998435)
> > >     at builtin-stat.c:929
> > > #4  0x0000000000455058 in run_perf_stat (run_idx=1701998435, argv=0xfffffaea2f90,
> > >     argc=1) at builtin-stat.c:947
> > > #5  cmd_stat (argc=1, argv=0xfffffaea2f90) at builtin-stat.c:2357
> > > #6  0x00000000004bb888 in run_builtin (p=p@entry=0x9764b8 <commands+288>,
> > >     argc=argc@entry=4, argv=argv@entry=0xfffffaea2f90) at perf.c:312
> > > #7  0x00000000004bbb54 in handle_internal_command (argc=argc@entry=4,
> > >     argv=argv@entry=0xfffffaea2f90) at perf.c:364
> > > #8  0x0000000000435378 in run_argv (argcp=<synthetic pointer>,
> > >     argv=<synthetic pointer>) at perf.c:408
> > > #9  main (argc=4, argv=0xfffffaea2f90) at perf.c:538
> > >
> > > After debugging, i found the root reason is that the xyarray fd is created
> > > by evsel__open_per_thread() ignoring the cpu passed in
> > > create_perf_stat_counter(), while the evsel' cpumap is assigned as the
> > > corresponding PMU's cpumap in __add_event(). Thus, the xyarray fd is created
> > > with ncpus of dummy cpumap and an out of bounds 'cpu' index will be used in
> > > perf_evsel__close_fd_cpu().
> > >
> > > To address this, add a flag to mark this situation and avoid using the
> > > affinity technique when closing/enabling/disabling events.
> > >
> > > Fixes: 7736627b865d ("perf stat: Use affinity for closing file descriptors")
> > > Fixes: 704e2f5b700d ("perf stat: Use affinity for enabling/disabling events")
> > > Signed-off-by: Wei Li <liwei391@huawei.com>
> > > ---
>
Namhyung Kim Sept. 23, 2020, 2:15 p.m. UTC | #6
On Wed, Sep 23, 2020 at 11:08 PM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Wed, Sep 23, 2020 at 10:49:52PM +0900, Namhyung Kim wrote:
> > On Wed, Sep 23, 2020 at 2:44 PM Jiri Olsa <jolsa@redhat.com> wrote:
> > >
> > > On Tue, Sep 22, 2020 at 11:13:45AM +0800, Wei Li wrote:
> > > > When executing perf stat with armv8_pmu events with a workload, it will
> > > > report a segfault as result.
> > >
> > > please share the perf stat command line you see that segfault for
> >
> > It seems the description in the patch 0/2 already has it:
> >
> >   [root@localhost hulk]# tools/perf/perf stat  -e
> > armv8_pmuv3_0/ll_cache_rd/,armv8_pmuv3_0/ll_cache_miss_rd/ ls >
> > /dev/null
> >   Segmentation fault
>
> yea I found it, but can't reproduce it.. I see the issue from
> patch 2, but not sure what's the problem so far

I think the problem is that armv8_pmu has a cpumask,
and the user requested per-task events.

The code tried to open the event with a dummy cpu map
since it's not a cpu event, but the pmu has cpu map and
it's passed to evsel.  So there's confusion somewhere
whether it should use evsel->cpus or a dummy map.

Thanks
Namhyung
Jiri Olsa Sept. 23, 2020, 8:19 p.m. UTC | #7
On Wed, Sep 23, 2020 at 11:15:06PM +0900, Namhyung Kim wrote:
> On Wed, Sep 23, 2020 at 11:08 PM Jiri Olsa <jolsa@redhat.com> wrote:
> >
> > On Wed, Sep 23, 2020 at 10:49:52PM +0900, Namhyung Kim wrote:
> > > On Wed, Sep 23, 2020 at 2:44 PM Jiri Olsa <jolsa@redhat.com> wrote:
> > > >
> > > > On Tue, Sep 22, 2020 at 11:13:45AM +0800, Wei Li wrote:
> > > > > When executing perf stat with armv8_pmu events with a workload, it will
> > > > > report a segfault as result.
> > > >
> > > > please share the perf stat command line you see that segfault for
> > >
> > > It seems the description in the patch 0/2 already has it:
> > >
> > >   [root@localhost hulk]# tools/perf/perf stat  -e
> > > armv8_pmuv3_0/ll_cache_rd/,armv8_pmuv3_0/ll_cache_miss_rd/ ls >
> > > /dev/null
> > >   Segmentation fault
> >
> > yea I found it, but can't reproduce it.. I see the issue from
> > patch 2, but not sure what's the problem so far
> 
> I think the problem is that armv8_pmu has a cpumask,
> and the user requested per-task events.
> 
> The code tried to open the event with a dummy cpu map
> since it's not a cpu event, but the pmu has cpu map and
> it's passed to evsel.  So there's confusion somewhere
> whether it should use evsel->cpus or a dummy map.

you're right, I have following cpus file in pmu:

  # cat /sys/devices/armv8_pmuv3_0/cpus 
  0-3

covering all the cpus.. and once you have cpumask/cpus file,
you're system wide by default in current code, but we should
not crash ;-)

I tried to cover this case in patch below and I probably broke
some other use cases, but perhaps we could allow to open counters
per cpus for given workload

I'll try to look at this more tomorrow

jirka


---
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 7f8d756d9408..0c7f16a673c2 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -379,12 +379,7 @@ static int read_affinity_counters(struct timespec *rs)
 	if (affinity__setup(&affinity) < 0)
 		return -1;
 
-	ncpus = perf_cpu_map__nr(evsel_list->core.all_cpus);
-	if (!target__has_cpu(&target) || target__has_per_thread(&target))
-		ncpus = 1;
 	evlist__for_each_cpu(evsel_list, i, cpu) {
-		if (i >= ncpus)
-			break;
 		affinity__set(&affinity, cpu);
 
 		evlist__for_each_entry(evsel_list, counter) {
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index fd865002cbbd..ef525eb2f619 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1861,6 +1861,16 @@ void evsel__close(struct evsel *evsel)
 	perf_evsel__free_id(&evsel->core);
 }
 
+int evsel__open_threads_per_cpu(struct evsel *evsel, struct perf_thread_map *threads,
+				struct perf_cpu_map *cpus, int cpu)
+{
+	if (cpu == -1)
+		return evsel__open_cpu(evsel, cpus, threads, 0,
+					cpus ? cpus->nr : 1);
+
+	return evsel__open_cpu(evsel, cpus, threads, cpu, cpu + 1);
+}
+
 int evsel__open_per_cpu(struct evsel *evsel, struct perf_cpu_map *cpus, int cpu)
 {
 	if (cpu == -1)
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 35e3f6d66085..1d055699bd1f 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -231,6 +231,8 @@ int evsel__enable(struct evsel *evsel);
 int evsel__disable(struct evsel *evsel);
 int evsel__disable_cpu(struct evsel *evsel, int cpu);
 
+int evsel__open_threads_per_cpu(struct evsel *evsel, struct perf_thread_map *threads,
+				struct perf_cpu_map *cpus, int cpu);
 int evsel__open_per_cpu(struct evsel *evsel, struct perf_cpu_map *cpus, int cpu);
 int evsel__open_per_thread(struct evsel *evsel, struct perf_thread_map *threads);
 int evsel__open(struct evsel *evsel, struct perf_cpu_map *cpus,
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index cdb154381a87..2b17f1315cfb 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -560,6 +560,11 @@ int create_perf_stat_counter(struct evsel *evsel,
 			attr->enable_on_exec = 1;
 	}
 
+	if (evsel->core.own_cpus && evsel->core.threads) {
+		return evsel__open_threads_per_cpu(evsel, evsel->core.threads,
+						   evsel__cpus(evsel), cpu);
+	}
+
 	if (target__has_cpu(target) && !target__has_per_thread(target))
 		return evsel__open_per_cpu(evsel, evsel__cpus(evsel), cpu);
Wei Li Sept. 24, 2020, 2:14 p.m. UTC | #8
Hi Andi,

On 2020/9/23 3:50, Andi Kleen wrote:
> On Tue, Sep 22, 2020 at 12:23:21PM -0700, Andi Kleen wrote:
>>> After debugging, i found the root reason is that the xyarray fd is created
>>> by evsel__open_per_thread() ignoring the cpu passed in
>>> create_perf_stat_counter(), while the evsel' cpumap is assigned as the
>>> corresponding PMU's cpumap in __add_event(). Thus, the xyarray fd is created
>>> with ncpus of dummy cpumap and an out of bounds 'cpu' index will be used in
>>> perf_evsel__close_fd_cpu().
>>>
>>> To address this, add a flag to mark this situation and avoid using the
>>> affinity technique when closing/enabling/disabling events.
>>
>> The flag seems like a hack. How about figuring out the correct number of 
>> CPUs and using that?
> 
> Also would like to understand what's different on ARM64 than other architectures.
> Or could this happen on x86 too?
> 

The problem is that when the user requests per-task events, the cpumask is expected
as NULL(dummy), while the armv8_pmu do has a cpumask which inherited by evsel.
The armv8_pmu's cpumask was added for heterogeneous systems. So this issue can not
happen on x86.

In fact, the cpumask is correct indeed, but it should't be used when we requesting
per-task events. As these events should be install on all cores, i doubt how much we
can benefit from the affinity technique, so i choosed to add a flag.

I also did a test on hisilicon arm64 d06 board, with 2 sockets 128 cores.
Testing the following command 3 times, with/without the affinity technique:

time tools/perf/perf stat -ddd -C 0-127 --per-core --timeout=5000 2> /dev/null

* (HEAD detached at 7074674e7338) perf cpumap: Maintain cpumaps ordered and without dups
real	0m8.039s
user	0m0.402s
sys	0m2.582s

real	0m7.939s
user	0m0.360s
sys	0m2.560s

real	0m7.997s
user	0m0.358s
sys	0m2.586s

* (HEAD detached at 704e2f5b700d) perf stat: Use affinity for enabling/disabling events
real	0m7.954s
user	0m0.308s
sys	0m2.590s

real	0m12.959s
user	0m0.332s
sys	0m2.582s

real	0m18.009s
user	0m0.346s
sys	0m2.562s

The offcpu time is much longer when using affinity, i think that's what migration costs,
could you please share me your test case?

Thanks,
Wei
Namhyung Kim Sept. 24, 2020, 2:36 p.m. UTC | #9
On Wed, Sep 23, 2020 at 10:19:00PM +0200, Jiri Olsa wrote:
> On Wed, Sep 23, 2020 at 11:15:06PM +0900, Namhyung Kim wrote:
> > I think the problem is that armv8_pmu has a cpumask,
> > and the user requested per-task events.
> > 
> > The code tried to open the event with a dummy cpu map
> > since it's not a cpu event, but the pmu has cpu map and
> > it's passed to evsel.  So there's confusion somewhere
> > whether it should use evsel->cpus or a dummy map.
> 
> you're right, I have following cpus file in pmu:
> 
>   # cat /sys/devices/armv8_pmuv3_0/cpus 
>   0-3
> 
> covering all the cpus.. and once you have cpumask/cpus file,
> you're system wide by default in current code, but we should
> not crash ;-)
> 
> I tried to cover this case in patch below and I probably broke
> some other use cases, but perhaps we could allow to open counters
> per cpus for given workload
> 
> I'll try to look at this more tomorrow

I'm thinking about a different approach, we can ignore cpu map
for the ARM cpu PMU and use the dummy, not tested ;-)

Thanks
Namhyung


diff --git a/tools/lib/perf/evlist.c b/tools/lib/perf/evlist.c
index 2208444ecb44..cfcdbd7be066 100644
--- a/tools/lib/perf/evlist.c
+++ b/tools/lib/perf/evlist.c
@@ -45,6 +45,9 @@ static void __perf_evlist__propagate_maps(struct perf_evlist *evlist,
        if (!evsel->own_cpus || evlist->has_user_cpus) {
                perf_cpu_map__put(evsel->cpus);
                evsel->cpus = perf_cpu_map__get(evlist->cpus);
+       } else if (!evsel->system_wide && perf_cpu_map__empty(evlist->cpus)) {
+               perf_cpu_map__put(evsel->cpus);
+               evsel->cpus = perf_cpu_map__get(evlist->cpus);
        } else if (evsel->cpus != evsel->own_cpus) {
                perf_cpu_map__put(evsel->cpus);
                evsel->cpus = perf_cpu_map__get(evsel->own_cpus);
Jiri Olsa Sept. 25, 2020, 9:01 p.m. UTC | #10
On Thu, Sep 24, 2020 at 11:36:23PM +0900, Namhyung Kim wrote:
> On Wed, Sep 23, 2020 at 10:19:00PM +0200, Jiri Olsa wrote:
> > On Wed, Sep 23, 2020 at 11:15:06PM +0900, Namhyung Kim wrote:
> > > I think the problem is that armv8_pmu has a cpumask,
> > > and the user requested per-task events.
> > > 
> > > The code tried to open the event with a dummy cpu map
> > > since it's not a cpu event, but the pmu has cpu map and
> > > it's passed to evsel.  So there's confusion somewhere
> > > whether it should use evsel->cpus or a dummy map.
> > 
> > you're right, I have following cpus file in pmu:
> > 
> >   # cat /sys/devices/armv8_pmuv3_0/cpus 
> >   0-3
> > 
> > covering all the cpus.. and once you have cpumask/cpus file,
> > you're system wide by default in current code, but we should
> > not crash ;-)
> > 
> > I tried to cover this case in patch below and I probably broke
> > some other use cases, but perhaps we could allow to open counters
> > per cpus for given workload
> > 
> > I'll try to look at this more tomorrow
> 
> I'm thinking about a different approach, we can ignore cpu map
> for the ARM cpu PMU and use the dummy, not tested ;-)
> 
> Thanks
> Namhyung
> 
> 
> diff --git a/tools/lib/perf/evlist.c b/tools/lib/perf/evlist.c
> index 2208444ecb44..cfcdbd7be066 100644
> --- a/tools/lib/perf/evlist.c
> +++ b/tools/lib/perf/evlist.c
> @@ -45,6 +45,9 @@ static void __perf_evlist__propagate_maps(struct perf_evlist *evlist,
>         if (!evsel->own_cpus || evlist->has_user_cpus) {
>                 perf_cpu_map__put(evsel->cpus);
>                 evsel->cpus = perf_cpu_map__get(evlist->cpus);
> +       } else if (!evsel->system_wide && perf_cpu_map__empty(evlist->cpus)) {
> +               perf_cpu_map__put(evsel->cpus);
> +               evsel->cpus = perf_cpu_map__get(evlist->cpus);

but I like this fix, because mine messes up scaling ;-)

>         } else if (evsel->cpus != evsel->own_cpus) {
>                 perf_cpu_map__put(evsel->cpus);
>                 evsel->cpus = perf_cpu_map__get(evsel->own_cpus);

it looks like that cpus (/sys/devices/armv8_pmuv3_0/cpus) file
can have some cpus switched off.. from brief scan of:

  drivers/perf/arm_pmu_platform.c (search for supported_cpus)

but it looks like it's just check for interrupt, so counting
might still work..?

thanks,
jirka
Jiri Olsa Oct. 2, 2020, 8:59 a.m. UTC | #11
On Thu, Sep 24, 2020 at 11:36:23PM +0900, Namhyung Kim wrote:
> On Wed, Sep 23, 2020 at 10:19:00PM +0200, Jiri Olsa wrote:
> > On Wed, Sep 23, 2020 at 11:15:06PM +0900, Namhyung Kim wrote:
> > > I think the problem is that armv8_pmu has a cpumask,
> > > and the user requested per-task events.
> > > 
> > > The code tried to open the event with a dummy cpu map
> > > since it's not a cpu event, but the pmu has cpu map and
> > > it's passed to evsel.  So there's confusion somewhere
> > > whether it should use evsel->cpus or a dummy map.
> > 
> > you're right, I have following cpus file in pmu:
> > 
> >   # cat /sys/devices/armv8_pmuv3_0/cpus 
> >   0-3
> > 
> > covering all the cpus.. and once you have cpumask/cpus file,
> > you're system wide by default in current code, but we should
> > not crash ;-)
> > 
> > I tried to cover this case in patch below and I probably broke
> > some other use cases, but perhaps we could allow to open counters
> > per cpus for given workload
> > 
> > I'll try to look at this more tomorrow
> 
> I'm thinking about a different approach, we can ignore cpu map
> for the ARM cpu PMU and use the dummy, not tested ;-)
> 
> Thanks
> Namhyung
> 
> 
> diff --git a/tools/lib/perf/evlist.c b/tools/lib/perf/evlist.c
> index 2208444ecb44..cfcdbd7be066 100644
> --- a/tools/lib/perf/evlist.c
> +++ b/tools/lib/perf/evlist.c
> @@ -45,6 +45,9 @@ static void __perf_evlist__propagate_maps(struct perf_evlist *evlist,
>         if (!evsel->own_cpus || evlist->has_user_cpus) {
>                 perf_cpu_map__put(evsel->cpus);
>                 evsel->cpus = perf_cpu_map__get(evlist->cpus);
> +       } else if (!evsel->system_wide && perf_cpu_map__empty(evlist->cpus)) {
> +               perf_cpu_map__put(evsel->cpus);
> +               evsel->cpus = perf_cpu_map__get(evlist->cpus);
>         } else if (evsel->cpus != evsel->own_cpus) {
>                 perf_cpu_map__put(evsel->cpus);
>                 evsel->cpus = perf_cpu_map__get(evsel->own_cpus);
> 

Wei Li,
is this fixing your problem?

thanks,
jirka
Song Bao Hua (Barry Song) Oct. 6, 2020, 6:51 a.m. UTC | #12
> -----Original Message-----
> From: Jiri Olsa [mailto:jolsa@redhat.com]
> Sent: Friday, October 2, 2020 10:00 PM
> To: Namhyung Kim <namhyung@kernel.org>; liwei (GF)
> <liwei391@huawei.com>
> Cc: Mark Rutland <mark.rutland@arm.com>; Andi Kleen <ak@linux.intel.com>;
> Alexander Shishkin <alexander.shishkin@linux.intel.com>; Alexey Budankov
> <alexey.budankov@linux.intel.com>; Adrian Hunter
> <adrian.hunter@intel.com>; Arnaldo Carvalho de Melo <acme@kernel.org>;
> linux-kernel <linux-kernel@vger.kernel.org>; Peter Zijlstra
> <peterz@infradead.org>; Andi Kleen <andi@firstfloor.org>; Libin (Huawei)
> <huawei.libin@huawei.com>; Ingo Molnar <mingo@redhat.com>;
> linux-arm-kernel@lists.infradead.org
> Subject: Re: [PATCH 1/2] perf stat: Fix segfault when counting armv8_pmu
> events
> 
> On Thu, Sep 24, 2020 at 11:36:23PM +0900, Namhyung Kim wrote:
> > On Wed, Sep 23, 2020 at 10:19:00PM +0200, Jiri Olsa wrote:
> > > On Wed, Sep 23, 2020 at 11:15:06PM +0900, Namhyung Kim wrote:
> > > > I think the problem is that armv8_pmu has a cpumask,
> > > > and the user requested per-task events.
> > > >
> > > > The code tried to open the event with a dummy cpu map
> > > > since it's not a cpu event, but the pmu has cpu map and
> > > > it's passed to evsel.  So there's confusion somewhere
> > > > whether it should use evsel->cpus or a dummy map.
> > >
> > > you're right, I have following cpus file in pmu:
> > >
> > >   # cat /sys/devices/armv8_pmuv3_0/cpus
> > >   0-3
> > >
> > > covering all the cpus.. and once you have cpumask/cpus file,
> > > you're system wide by default in current code, but we should
> > > not crash ;-)
> > >
> > > I tried to cover this case in patch below and I probably broke
> > > some other use cases, but perhaps we could allow to open counters
> > > per cpus for given workload
> > >
> > > I'll try to look at this more tomorrow
> >
> > I'm thinking about a different approach, we can ignore cpu map
> > for the ARM cpu PMU and use the dummy, not tested ;-)
> >
> > Thanks
> > Namhyung
> >
> >
> > diff --git a/tools/lib/perf/evlist.c b/tools/lib/perf/evlist.c
> > index 2208444ecb44..cfcdbd7be066 100644
> > --- a/tools/lib/perf/evlist.c
> > +++ b/tools/lib/perf/evlist.c
> > @@ -45,6 +45,9 @@ static void __perf_evlist__propagate_maps(struct
> perf_evlist *evlist,
> >         if (!evsel->own_cpus || evlist->has_user_cpus) {
> >                 perf_cpu_map__put(evsel->cpus);
> >                 evsel->cpus = perf_cpu_map__get(evlist->cpus);
> > +       } else if (!evsel->system_wide &&
> perf_cpu_map__empty(evlist->cpus)) {
> > +               perf_cpu_map__put(evsel->cpus);
> > +               evsel->cpus = perf_cpu_map__get(evlist->cpus);
> >         } else if (evsel->cpus != evsel->own_cpus) {
> >                 perf_cpu_map__put(evsel->cpus);
> >                 evsel->cpus = perf_cpu_map__get(evsel->own_cpus);
> >
> 
> Wei Li,
> is this fixing your problem?

As I have seen the same crash and suggested another patch:
[PATCH] perf evlist: fix memory corruption for Kernel PMU event
https://lore.kernel.org/lkml/20201001115729.27116-1-song.bao.hua@hisilicon.com/

Also, I have tested Namhyung Kim's patch on ARM64. It does fix the crash for me. So:
Tested-by: Barry Song <song.bao.hua@hisilicon.com>

Please put the below fixes-tag in commit log:
Fixes: 7736627b865d ("perf stat: Use affinity for closing file descriptors")

Thanks
Barry
diff mbox series

Patch

diff --git a/tools/lib/perf/include/internal/evlist.h b/tools/lib/perf/include/internal/evlist.h
index 2d0fa02b036f..c02d7e583846 100644
--- a/tools/lib/perf/include/internal/evlist.h
+++ b/tools/lib/perf/include/internal/evlist.h
@@ -17,6 +17,7 @@  struct perf_evlist {
 	struct list_head	 entries;
 	int			 nr_entries;
 	bool			 has_user_cpus;
+	bool			 open_per_thread;
 	struct perf_cpu_map	*cpus;
 	struct perf_cpu_map	*all_cpus;
 	struct perf_thread_map	*threads;
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index fddc97cac984..6e6ceacce634 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -725,6 +725,9 @@  static int __run_perf_stat(int argc, const char **argv, int run_idx)
 	if (group)
 		perf_evlist__set_leader(evsel_list);
 
+	if (!(target__has_cpu(&target) && !target__has_per_thread(&target)))
+		evsel_list->core.open_per_thread = true;
+
 	if (affinity__setup(&affinity) < 0)
 		return -1;
 
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index e3fa3bf7498a..bf8a3ccc599f 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -383,6 +383,15 @@  void evlist__disable(struct evlist *evlist)
 	int cpu, i, imm = 0;
 	bool has_imm = false;
 
+	if (evlist->core.open_per_thread) {
+		evlist__for_each_entry(evlist, pos) {
+			if (pos->disabled || !evsel__is_group_leader(pos) || !pos->core.fd)
+				continue;
+			evsel__disable(pos);
+		}
+		goto out;
+	}
+
 	if (affinity__setup(&affinity) < 0)
 		return;
 
@@ -414,6 +423,7 @@  void evlist__disable(struct evlist *evlist)
 		pos->disabled = true;
 	}
 
+out:
 	evlist->enabled = false;
 }
 
@@ -423,6 +433,15 @@  void evlist__enable(struct evlist *evlist)
 	struct affinity affinity;
 	int cpu, i;
 
+	if (evlist->core.open_per_thread) {
+		evlist__for_each_entry(evlist, pos) {
+			if (!evsel__is_group_leader(pos) || !pos->core.fd)
+				continue;
+			evsel__enable(pos);
+		}
+		goto out;
+	}
+
 	if (affinity__setup(&affinity) < 0)
 		return;
 
@@ -444,6 +463,7 @@  void evlist__enable(struct evlist *evlist)
 		pos->disabled = false;
 	}
 
+out:
 	evlist->enabled = true;
 }
 
@@ -1223,9 +1243,10 @@  void evlist__close(struct evlist *evlist)
 
 	/*
 	 * With perf record core.cpus is usually NULL.
+	 * Or perf stat may open events per-thread.
 	 * Use the old method to handle this for now.
 	 */
-	if (!evlist->core.cpus) {
+	if (evlist->core.open_per_thread || !evlist->core.cpus) {
 		evlist__for_each_entry_reverse(evlist, evsel)
 			evsel__close(evsel);
 		return;