[5/5] perf docs: arm_spe: Document new discard mode

Message ID	20241217115610.371755-6-james.clark@linaro.org (mailing list archive)
State	Superseded
Headers	show Received: from mail-wm1-f49.google.com (mail-wm1-f49.google.com [209.85.128.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B7A401F4289 for <bpf@vger.kernel.org>; Tue, 17 Dec 2024 11:56:42 +0000 (UTC) From: James Clark <james.clark@linaro.org> To: linux-arm-kernel@lists.infradead.org, linux-perf-users@vger.kernel.org Cc: James Clark <james.clark@linaro.org>, Will Deacon <will@kernel.org>, Mark Rutland <mark.rutland@arm.com>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Arnaldo Carvalho de Melo <acme@kernel.org>, Namhyung Kim <namhyung@kernel.org>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>, Adrian Hunter <adrian.hunter@intel.com>, "Liang, Kan" <kan.liang@linux.intel.com>, John Garry <john.g.garry@oracle.com>, Mike Leach <mike.leach@linaro.org>, Leo Yan <leo.yan@linux.dev>, Graham Woodward <graham.woodward@arm.com>, linux-kernel@vger.kernel.org, bpf@vger.kernel.org Subject: [PATCH 5/5] perf docs: arm_spe: Document new discard mode Date: Tue, 17 Dec 2024 11:56:08 +0000 Message-Id: <20241217115610.371755-6-james.clark@linaro.org> In-Reply-To: <20241217115610.371755-1-james.clark@linaro.org> References: <20241217115610.371755-1-james.clark@linaro.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	perf: arm_spe: Add format option for discard mode \| expand [0/5] perf: arm_spe: Add format option for discard mode [1/5] perf: arm_spe: Add format option for discard mode [2/5] perf tool: arm-spe: Pull out functions for aux buffer and tracking setup [3/5] perf tool: arm-spe: Don't allocate buffer or tracking event in discard mode [4/5] perf test: arm_spe: Add test for discard mode [5/5] perf docs: arm_spe: Document new discard mode

Context	Check	Description
netdev/tree_selection	success	Not a local patch

James Clark Dec. 17, 2024, 11:56 a.m. UTC

Document the flag, hint what it's used for and give an example with
other useful options to get minimal output.

Signed-off-by: James Clark <james.clark@linaro.org>
---
 tools/perf/Documentation/perf-arm-spe.txt | 11 +++++++++++
 1 file changed, 11 insertions(+)

Ian Rogers Dec. 18, 2024, 12:54 a.m. UTC | #1

On Tue, Dec 17, 2024 at 3:56 AM James Clark <james.clark@linaro.org> wrote:
>
> Document the flag, hint what it's used for and give an example with
> other useful options to get minimal output.
>
> Signed-off-by: James Clark <james.clark@linaro.org>
> ---
>  tools/perf/Documentation/perf-arm-spe.txt | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/tools/perf/Documentation/perf-arm-spe.txt b/tools/perf/Documentation/perf-arm-spe.txt
> index de2b0b479249..588eead438bc 100644
> --- a/tools/perf/Documentation/perf-arm-spe.txt
> +++ b/tools/perf/Documentation/perf-arm-spe.txt
> @@ -150,6 +150,7 @@ arm_spe/load_filter=1,min_latency=10/'
>    pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
>    store_filter=1      - collect stores only (PMSFCR.ST)
>    ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)
> +  discard=1           - enable SPE PMU events but don't collect sample data - see 'Discard mode' (PMBLIMITR.FM = DISCARD)
>
>  +++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
>  than only the execution latency.
> @@ -220,6 +221,16 @@ Common errors
>
>     Increase sampling interval (see above)
>
> +Discard mode
> +~~~~~~~~~~~~
> +
> +SPE PMU events can be used without the overhead of collecting sample data if
> +discard mode is supported (optional from Armv8.6). First run a system wide SPE
> +session (or on the core of interest) using options to minimize output. Then run
> +perf stat:
> +
> +  perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null &
> +  perf stat -e SAMPLE_FEED_LD

Perhaps clarify this should be an ARM SPE event? It seems strange to
have one perf command affect a later one, the purpose of things like
event multiplexing is to hide the hardware limits. I'd prefer if the
last bit was like:
```
Then run perf stat with an SPE event on the same PMU:

perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null &
perf stat -e arm_spe/SAMPLE_FEED_LD/
``

Thanks,
Ian

James Clark Dec. 18, 2024, 10:07 a.m. UTC | #2

On 18/12/2024 12:54 am, Ian Rogers wrote:
> On Tue, Dec 17, 2024 at 3:56 AM James Clark <james.clark@linaro.org> wrote:
>>
>> Document the flag, hint what it's used for and give an example with
>> other useful options to get minimal output.
>>
>> Signed-off-by: James Clark <james.clark@linaro.org>
>> ---
>>   tools/perf/Documentation/perf-arm-spe.txt | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/tools/perf/Documentation/perf-arm-spe.txt b/tools/perf/Documentation/perf-arm-spe.txt
>> index de2b0b479249..588eead438bc 100644
>> --- a/tools/perf/Documentation/perf-arm-spe.txt
>> +++ b/tools/perf/Documentation/perf-arm-spe.txt
>> @@ -150,6 +150,7 @@ arm_spe/load_filter=1,min_latency=10/'
>>     pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
>>     store_filter=1      - collect stores only (PMSFCR.ST)
>>     ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)
>> +  discard=1           - enable SPE PMU events but don't collect sample data - see 'Discard mode' (PMBLIMITR.FM = DISCARD)
>>
>>   +++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
>>   than only the execution latency.
>> @@ -220,6 +221,16 @@ Common errors
>>
>>      Increase sampling interval (see above)
>>
>> +Discard mode
>> +~~~~~~~~~~~~
>> +
>> +SPE PMU events can be used without the overhead of collecting sample data if
>> +discard mode is supported (optional from Armv8.6). First run a system wide SPE
>> +session (or on the core of interest) using options to minimize output. Then run
>> +perf stat:
>> +
>> +  perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null &
>> +  perf stat -e SAMPLE_FEED_LD
> 
> Perhaps clarify this should be an ARM SPE event? It seems strange to
> have one perf command affect a later one, the purpose of things like
> event multiplexing is to hide the hardware limits. I'd prefer if the
> last bit was like:
> ```
> Then run perf stat with an SPE event on the same PMU:
> 
> perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null &
> perf stat -e arm_spe/SAMPLE_FEED_LD/
> ``
> 
> Thanks,
> Ian

Hi Ian,

Confusingly this isn't an SPE event, it is a normal PMU event. The fact 
that one Perf command affects the other is because these events only 
count when SPE is enabled. When it's enabled it has an effect on a 
per-core level which is why in the example I made it simpler by enabling 
SPE system wide.

SPE is an exclusive PMU like Coresight and some others so it can't be 
affected by multiplexing or anything like that. The SAMPLE_FEED_LD PMU 
would be, but as long as SPE stays enabled it will count the right thing 
regardless of multiplexing.

THanks
James

Ian Rogers Dec. 18, 2024, 7:47 p.m. UTC | #3

On Wed, Dec 18, 2024 at 2:07 AM James Clark <james.clark@linaro.org> wrote:
>
> On 18/12/2024 12:54 am, Ian Rogers wrote:
> > On Tue, Dec 17, 2024 at 3:56 AM James Clark <james.clark@linaro.org> wrote:
> >>
> >> Document the flag, hint what it's used for and give an example with
> >> other useful options to get minimal output.
> >>
> >> Signed-off-by: James Clark <james.clark@linaro.org>
> >> ---
> >>   tools/perf/Documentation/perf-arm-spe.txt | 11 +++++++++++
> >>   1 file changed, 11 insertions(+)
> >>
> >> diff --git a/tools/perf/Documentation/perf-arm-spe.txt b/tools/perf/Documentation/perf-arm-spe.txt
> >> index de2b0b479249..588eead438bc 100644
> >> --- a/tools/perf/Documentation/perf-arm-spe.txt
> >> +++ b/tools/perf/Documentation/perf-arm-spe.txt
> >> @@ -150,6 +150,7 @@ arm_spe/load_filter=1,min_latency=10/'
> >>     pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
> >>     store_filter=1      - collect stores only (PMSFCR.ST)
> >>     ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)
> >> +  discard=1           - enable SPE PMU events but don't collect sample data - see 'Discard mode' (PMBLIMITR.FM = DISCARD)
> >>
> >>   +++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
> >>   than only the execution latency.
> >> @@ -220,6 +221,16 @@ Common errors
> >>
> >>      Increase sampling interval (see above)
> >>
> >> +Discard mode
> >> +~~~~~~~~~~~~
> >> +
> >> +SPE PMU events can be used without the overhead of collecting sample data if
> >> +discard mode is supported (optional from Armv8.6). First run a system wide SPE
> >> +session (or on the core of interest) using options to minimize output. Then run
> >> +perf stat:
> >> +
> >> +  perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null &
> >> +  perf stat -e SAMPLE_FEED_LD
> >
> > Perhaps clarify this should be an ARM SPE event? It seems strange to
> > have one perf command affect a later one, the purpose of things like
> > event multiplexing is to hide the hardware limits. I'd prefer if the
> > last bit was like:
> > ```
> > Then run perf stat with an SPE event on the same PMU:
> >
> > perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null &
> > perf stat -e arm_spe/SAMPLE_FEED_LD/
> > ``
> >
> > Thanks,
> > Ian
>
> Hi Ian,
>
> Confusingly this isn't an SPE event, it is a normal PMU event. The fact
> that one Perf command affects the other is because these events only
> count when SPE is enabled. When it's enabled it has an effect on a
> per-core level which is why in the example I made it simpler by enabling
> SPE system wide.
>
> SPE is an exclusive PMU like Coresight and some others so it can't be
> affected by multiplexing or anything like that. The SAMPLE_FEED_LD PMU
> would be, but as long as SPE stays enabled it will count the right thing
> regardless of multiplexing.

Thanks James, sorry for my SPE ignorance. I'm smiling about the use of
the word exclusive. When I was trying to make the tests run in
parallel I used a file lock - so shared and exclusive. There were a
lot of issues with that, hence switching to 2 phases in the test,
parallel then sequential but I kept the "exclusive" tag for want of a
better word. Perhaps the notion of an exclusive PMU existed previously
but maybe I've accidentally invented the term by way of a failed file
lock experiment :-)

Presumably the two PMUs side-effecting each other is a known thing. I
wonder if we can capture this in the documentation. When you say
"normal PMU event" you mean core PMU events?

Thanks,
Ian

James Clark Dec. 19, 2024, 10:10 a.m. UTC | #4

On 18/12/2024 7:47 pm, Ian Rogers wrote:
> On Wed, Dec 18, 2024 at 2:07 AM James Clark <james.clark@linaro.org> wrote:
>>
>> On 18/12/2024 12:54 am, Ian Rogers wrote:
>>> On Tue, Dec 17, 2024 at 3:56 AM James Clark <james.clark@linaro.org> wrote:
>>>>
>>>> Document the flag, hint what it's used for and give an example with
>>>> other useful options to get minimal output.
>>>>
>>>> Signed-off-by: James Clark <james.clark@linaro.org>
>>>> ---
>>>>    tools/perf/Documentation/perf-arm-spe.txt | 11 +++++++++++
>>>>    1 file changed, 11 insertions(+)
>>>>
>>>> diff --git a/tools/perf/Documentation/perf-arm-spe.txt b/tools/perf/Documentation/perf-arm-spe.txt
>>>> index de2b0b479249..588eead438bc 100644
>>>> --- a/tools/perf/Documentation/perf-arm-spe.txt
>>>> +++ b/tools/perf/Documentation/perf-arm-spe.txt
>>>> @@ -150,6 +150,7 @@ arm_spe/load_filter=1,min_latency=10/'
>>>>      pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
>>>>      store_filter=1      - collect stores only (PMSFCR.ST)
>>>>      ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)
>>>> +  discard=1           - enable SPE PMU events but don't collect sample data - see 'Discard mode' (PMBLIMITR.FM = DISCARD)
>>>>
>>>>    +++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
>>>>    than only the execution latency.
>>>> @@ -220,6 +221,16 @@ Common errors
>>>>
>>>>       Increase sampling interval (see above)
>>>>
>>>> +Discard mode
>>>> +~~~~~~~~~~~~
>>>> +
>>>> +SPE PMU events can be used without the overhead of collecting sample data if
>>>> +discard mode is supported (optional from Armv8.6). First run a system wide SPE
>>>> +session (or on the core of interest) using options to minimize output. Then run
>>>> +perf stat:
>>>> +
>>>> +  perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null &
>>>> +  perf stat -e SAMPLE_FEED_LD
>>>
>>> Perhaps clarify this should be an ARM SPE event? It seems strange to
>>> have one perf command affect a later one, the purpose of things like
>>> event multiplexing is to hide the hardware limits. I'd prefer if the
>>> last bit was like:
>>> ```
>>> Then run perf stat with an SPE event on the same PMU:
>>>
>>> perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null &
>>> perf stat -e arm_spe/SAMPLE_FEED_LD/
>>> ``
>>>
>>> Thanks,
>>> Ian
>>
>> Hi Ian,
>>
>> Confusingly this isn't an SPE event, it is a normal PMU event. The fact
>> that one Perf command affects the other is because these events only
>> count when SPE is enabled. When it's enabled it has an effect on a
>> per-core level which is why in the example I made it simpler by enabling
>> SPE system wide.
>>
>> SPE is an exclusive PMU like Coresight and some others so it can't be
>> affected by multiplexing or anything like that. The SAMPLE_FEED_LD PMU
>> would be, but as long as SPE stays enabled it will count the right thing
>> regardless of multiplexing.
> 
> Thanks James, sorry for my SPE ignorance. I'm smiling about the use of
> the word exclusive. When I was trying to make the tests run in
> parallel I used a file lock - so shared and exclusive. There were a
> lot of issues with that, hence switching to 2 phases in the test,
> parallel then sequential but I kept the "exclusive" tag for want of a
> better word. Perhaps the notion of an exclusive PMU existed previously

Yeah, see PERF_PMU_CAP_EXCLUSIVE. Hopefully it doesn't cause too much 
confusion, the context of test vs PMU should make it clear.

> but maybe I've accidentally invented the term by way of a failed file
> lock experiment :-)
> 
> Presumably the two PMUs side-effecting each other is a known thing. I
> wonder if we can capture this in the documentation. When you say
> "normal PMU event" you mean core PMU events?
> 
> Thanks,
> Ian

It should be a known thing yes, discard mode doesn't change this 
behavior anyway but just makes one use case of it better. I can add 
another section to this SPE manpage about it in a v2, that's probably 
the best place for it.

And yes, I meant core PMU event. I can clarify that the second example 
command is for a core PMU to avoid any doubt.

Thanks
James

[5/5] perf docs: arm_spe: Document new discard mode

Checks

Commit Message

Comments

Patch