diff mbox series

[v8,1/2] perf, uncore: Adding documentation for ThunderX2 pmu uncore driver

Message ID 20181122030354.13570-2-ganapatrao.kulkarni@cavium.com (mailing list archive)
State New, archived
Headers show
Series Add ThunderX2 SoC Performance Monitoring Unit driver | expand

Commit Message

Kulkarni, Ganapatrao Nov. 22, 2018, 3:04 a.m. UTC
The SoC has PMU support in its L3 cache controller (L3C) and in the
DDR4 Memory Controller (DMC).

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
---
 Documentation/perf/thunderx2-pmu.txt | 106 +++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)
 create mode 100644 Documentation/perf/thunderx2-pmu.txt

Comments

Will Deacon Dec. 3, 2018, 12:09 p.m. UTC | #1
On Thu, Nov 22, 2018 at 03:04:31AM +0000, Kulkarni, Ganapatrao wrote:
> The SoC has PMU support in its L3 cache controller (L3C) and in the
> DDR4 Memory Controller (DMC).
> 
> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
> ---
>  Documentation/perf/thunderx2-pmu.txt | 106 +++++++++++++++++++++++++++
>  1 file changed, 106 insertions(+)
>  create mode 100644 Documentation/perf/thunderx2-pmu.txt

Thanks for writing the documentation, although I think it needs a bit of
help before we can merge it.

> diff --git a/Documentation/perf/thunderx2-pmu.txt b/Documentation/perf/thunderx2-pmu.txt
> new file mode 100644
> index 000000000000..9f5dd7459e68
> --- /dev/null
> +++ b/Documentation/perf/thunderx2-pmu.txt
> @@ -0,0 +1,106 @@
> +
> +Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
> +==========================================================================
> +
> +ThunderX2 SoC PMU consists of independent system wide per Socket PMUs such
> +as Level 3 Cache(L3C) and DDR4 Memory Controller(DMC).

Please add some punctuation here.

> +
> +DMC has 8 interleave channels and L3C has 16 interleave tiles. Events are

*The* DMC and *the* L3C


> +sampled for default channel(i.e channel 0) and prorated to total number of

I'm not sure I understand this; are you saying it's not possible to sample
channels other than channel 0?

> +channels/tiles.
> +
> +DMC and L3C, Each PMU supports up to 4 counters. Counters are independently

The start of this sentence makes no sense and you've got a capital "Each".

> +programmable and can be started and stopped individually. Each counter can
> +be set to sample specific perf events. Counters are 32 bit and do not support
> +overflow interrupt; they are sampled at every 2 seconds.

I think this is unfortunate wording, because actually you don't support what
perf calls "sampling" at all.

> +
> +PMU UNCORE (perf) driver:
> +
> +The thunderx2-pmu driver registers several perf PMUs for DMC and L3C devices.

I think the driver name uses an underscore instead of a hyphen.

> +Each of the PMUs provides description of its available events
> +and configuration options in sysfs.
> +	see /sys/devices/uncore_<l3c_S/dmc_S/>
> +
> +S is socket id.

*the* socket id

> +Each PMU can be used to sample up to 4 events simultaneously.
> +
> +The "format" directory describes format of the config (event ID).
> +The "events" directory provides configuration templates for all
> +supported event types that can be used with perf tool.

You can drop this bit, since it's not specific to your PMU and is actually
describing the perf ABI via sysfs. If we want to document that someplace, it
should be in a separate file.

> +
> +For example, "uncore_dmc_0/cnt_cycles/" is an
> +equivalent of "uncore_dmc_0/config=0x1/".

Why is this helpful?

> +
> +Each perf driver also provides a "cpumask" sysfs attribute, which contains a
> +single CPU ID of the processor which is likely to be used to handle all the
> +PMU events. It will be the first online CPU from the NUMA node of the PMU device.

Again, I don't think this really belongs in here.

> +
> +Example for perf tool use:
> +
> +perf stat -a -e uncore_dmc_0/cnt_cycles/ sleep 1
> +
> +perf stat -a -e \
> +uncore_dmc_0/cnt_cycles/,\
> +uncore_dmc_0/data_transfers/,\
> +uncore_dmc_0/read_txns/,\
> +uncore_dmc_0/write_txns/ sleep 1
> +
> +perf stat -a -e \
> +uncore_l3c_0/read_request/,\
> +uncore_l3c_0/read_hit/,\
> +uncore_l3c_0/inv_request/,\
> +uncore_l3c_0/inv_hit/ sleep 1
> +
> +The driver does not support sampling, therefore "perf record" will
> +not work. Per-task (without "-a") perf sessions are not supported.

What do you mean by "not supported"? If I invoke perf as:

# ./perf stat -e uncore_dmc_0/cnt_cycles/ -- ls

then I get results back.

> +
> +L3C events:
> +============
> +
> +read_request:
> +	Number of Read requests received by the L3 Cache.
> +	This include Read as well as Read Exclusives.
> +
> +read_hit:
> +	Number of Read requests received by the L3 cache that were hit
> +	in the L3 (Data provided form the L3)
> +
> +writeback_request:
> +	Number of Write Backs received by the L3 Cache. These are basically
> +	the L2 Evicts and writes from the PCIe Write Cache.
> +
> +inv_nwrite_request:
> +	This is the Number of Invalidate and Write received by the L3 Cache.
> +	Also Writes from IO that did not go through the PCIe Write Cache.
> +
> +inv_nwrite_hit
> +	This is the Number of Invalidate and Write received by the L3 Cache
> +	That were a hit in the L3 Cache.
> +
> +inv_request:
> +	Number of Invalidate request received by the L3 Cache.
> +
> +inv_hit:
> +	Number of Invalidate request received by the L3 Cache that were a
> +	hit in L3.
> +
> +evict_request:
> +	Number of Evicts that the L3 generated.

Wouldn't this be better off in the perf tools sources, as part of the JSON
events file for your PMU?

> +
> +NOTE:
> +1. Granularity of all these events counter value is cache line length(64 Bytes).
> +2. L3C cache Hit Ratio = (read_hit + inv_nwrite_hit + inv_hit) / (read_request + inv_nwrite_request + inv_request)
> +
> +DMC events:
> +============
> +cnt_cycles:
> +	Count cycles (Clocks at the DMC clock rate)
> +
> +write_txns:
> +	Number of 64 Bytes write transactions received by the DMC(s)
> +
> +read_txns:
> +	Number of 64 Bytes Read transactions received by the DMC(s)
> +
> +data_transfers:
> +	Number of 64 Bytes data transferred to or from DRAM.

Same here.

Will
Ganapatrao Kulkarni Dec. 4, 2018, 5:24 a.m. UTC | #2
Hi Will,

On Mon, Dec 3, 2018 at 5:39 PM Will Deacon <will.deacon@arm.com> wrote:
>
> On Thu, Nov 22, 2018 at 03:04:31AM +0000, Kulkarni, Ganapatrao wrote:
> > The SoC has PMU support in its L3 cache controller (L3C) and in the
> > DDR4 Memory Controller (DMC).
> >
> > Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
> > ---
> >  Documentation/perf/thunderx2-pmu.txt | 106 +++++++++++++++++++++++++++
> >  1 file changed, 106 insertions(+)
> >  create mode 100644 Documentation/perf/thunderx2-pmu.txt
>
> Thanks for writing the documentation, although I think it needs a bit of
> help before we can merge it.

sure will send next version ASAP.
>
> > diff --git a/Documentation/perf/thunderx2-pmu.txt b/Documentation/perf/thunderx2-pmu.txt
> > new file mode 100644
> > index 000000000000..9f5dd7459e68
> > --- /dev/null
> > +++ b/Documentation/perf/thunderx2-pmu.txt
> > @@ -0,0 +1,106 @@
> > +
> > +Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
> > +==========================================================================
> > +
> > +ThunderX2 SoC PMU consists of independent system wide per Socket PMUs such
> > +as Level 3 Cache(L3C) and DDR4 Memory Controller(DMC).
>
> Please add some punctuation here.

Thanks will do.
>
> > +
> > +DMC has 8 interleave channels and L3C has 16 interleave tiles. Events are
>
> *The* DMC and *the* L3C

ok
>
>
> > +sampled for default channel(i.e channel 0) and prorated to total number of
>
> I'm not sure I understand this; are you saying it's not possible to sample
> channels other than channel 0?

yes, sampling channel zero, since channels are interleave, multiplying
by number of channels will give fair data.
Removed per channel sample since it was involved SMC calls.

>
> > +channels/tiles.
> > +
> > +DMC and L3C, Each PMU supports up to 4 counters. Counters are independently
>
> The start of this sentence makes no sense and you've got a capital "Each".
>
> > +programmable and can be started and stopped individually. Each counter can
> > +be set to sample specific perf events. Counters are 32 bit and do not support
> > +overflow interrupt; they are sampled at every 2 seconds.
>
> I think this is unfortunate wording, because actually you don't support what
> perf calls "sampling" at all.

ok, let me rephrase it.
>
> > +
> > +PMU UNCORE (perf) driver:
> > +
> > +The thunderx2-pmu driver registers several perf PMUs for DMC and L3C devices.
>
> I think the driver name uses an underscore instead of a hyphen.

thanks.
>
> > +Each of the PMUs provides description of its available events
> > +and configuration options in sysfs.
> > +     see /sys/devices/uncore_<l3c_S/dmc_S/>
> > +
> > +S is socket id.
>
> *the* socket id

ok
>
> > +Each PMU can be used to sample up to 4 events simultaneously.
> > +
> > +The "format" directory describes format of the config (event ID).
> > +The "events" directory provides configuration templates for all
> > +supported event types that can be used with perf tool.
>
> You can drop this bit, since it's not specific to your PMU and is actually
> describing the perf ABI via sysfs. If we want to document that someplace, it
> should be in a separate file.

ok , let me drop sysfs part.
>
> > +
> > +For example, "uncore_dmc_0/cnt_cycles/" is an
> > +equivalent of "uncore_dmc_0/config=0x1/".
>
> Why is this helpful?

ok let me drop second line.
>
> > +
> > +Each perf driver also provides a "cpumask" sysfs attribute, which contains a
> > +single CPU ID of the processor which is likely to be used to handle all the
> > +PMU events. It will be the first online CPU from the NUMA node of the PMU device.
>
> Again, I don't think this really belongs in here.

ok.
>
> > +
> > +Example for perf tool use:
> > +
> > +perf stat -a -e uncore_dmc_0/cnt_cycles/ sleep 1
> > +
> > +perf stat -a -e \
> > +uncore_dmc_0/cnt_cycles/,\
> > +uncore_dmc_0/data_transfers/,\
> > +uncore_dmc_0/read_txns/,\
> > +uncore_dmc_0/write_txns/ sleep 1
> > +
> > +perf stat -a -e \
> > +uncore_l3c_0/read_request/,\
> > +uncore_l3c_0/read_hit/,\
> > +uncore_l3c_0/inv_request/,\
> > +uncore_l3c_0/inv_hit/ sleep 1
> > +
> > +The driver does not support sampling, therefore "perf record" will
> > +not work. Per-task (without "-a") perf sessions are not supported.
>
> What do you mean by "not supported"? If I invoke perf as:

I mean --per-core option, needs rephrasing.

>
> # ./perf stat -e uncore_dmc_0/cnt_cycles/ -- ls
>
> then I get results back.
>
> > +
> > +L3C events:
> > +============
> > +
> > +read_request:
> > +     Number of Read requests received by the L3 Cache.
> > +     This include Read as well as Read Exclusives.
> > +
> > +read_hit:
> > +     Number of Read requests received by the L3 cache that were hit
> > +     in the L3 (Data provided form the L3)
> > +
> > +writeback_request:
> > +     Number of Write Backs received by the L3 Cache. These are basically
> > +     the L2 Evicts and writes from the PCIe Write Cache.
> > +
> > +inv_nwrite_request:
> > +     This is the Number of Invalidate and Write received by the L3 Cache.
> > +     Also Writes from IO that did not go through the PCIe Write Cache.
> > +
> > +inv_nwrite_hit
> > +     This is the Number of Invalidate and Write received by the L3 Cache
> > +     That were a hit in the L3 Cache.
> > +
> > +inv_request:
> > +     Number of Invalidate request received by the L3 Cache.
> > +
> > +inv_hit:
> > +     Number of Invalidate request received by the L3 Cache that were a
> > +     hit in L3.
> > +
> > +evict_request:
> > +     Number of Evicts that the L3 generated.
>
> Wouldn't this be better off in the perf tools sources, as part of the JSON
> events file for your PMU?

That could be another effort to move all arm64 vendors uncore events
to JSON framework.
>
> > +
> > +NOTE:
> > +1. Granularity of all these events counter value is cache line length(64 Bytes).
> > +2. L3C cache Hit Ratio = (read_hit + inv_nwrite_hit + inv_hit) / (read_request + inv_nwrite_request + inv_request)
> > +
> > +DMC events:
> > +============
> > +cnt_cycles:
> > +     Count cycles (Clocks at the DMC clock rate)
> > +
> > +write_txns:
> > +     Number of 64 Bytes write transactions received by the DMC(s)
> > +
> > +read_txns:
> > +     Number of 64 Bytes Read transactions received by the DMC(s)
> > +
> > +data_transfers:
> > +     Number of 64 Bytes data transferred to or from DRAM.
>
> Same here.
>
> Will

thanks
Ganapat
diff mbox series

Patch

diff --git a/Documentation/perf/thunderx2-pmu.txt b/Documentation/perf/thunderx2-pmu.txt
new file mode 100644
index 000000000000..9f5dd7459e68
--- /dev/null
+++ b/Documentation/perf/thunderx2-pmu.txt
@@ -0,0 +1,106 @@ 
+
+Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
+==========================================================================
+
+ThunderX2 SoC PMU consists of independent system wide per Socket PMUs such
+as Level 3 Cache(L3C) and DDR4 Memory Controller(DMC).
+
+DMC has 8 interleave channels and L3C has 16 interleave tiles. Events are
+sampled for default channel(i.e channel 0) and prorated to total number of
+channels/tiles.
+
+DMC and L3C, Each PMU supports up to 4 counters. Counters are independently
+programmable and can be started and stopped individually. Each counter can
+be set to sample specific perf events. Counters are 32 bit and do not support
+overflow interrupt; they are sampled at every 2 seconds.
+
+PMU UNCORE (perf) driver:
+
+The thunderx2-pmu driver registers several perf PMUs for DMC and L3C devices.
+Each of the PMUs provides description of its available events
+and configuration options in sysfs.
+	see /sys/devices/uncore_<l3c_S/dmc_S/>
+
+S is socket id.
+Each PMU can be used to sample up to 4 events simultaneously.
+
+The "format" directory describes format of the config (event ID).
+The "events" directory provides configuration templates for all
+supported event types that can be used with perf tool.
+
+For example, "uncore_dmc_0/cnt_cycles/" is an
+equivalent of "uncore_dmc_0/config=0x1/".
+
+Each perf driver also provides a "cpumask" sysfs attribute, which contains a
+single CPU ID of the processor which is likely to be used to handle all the
+PMU events. It will be the first online CPU from the NUMA node of the PMU device.
+
+Example for perf tool use:
+
+perf stat -a -e uncore_dmc_0/cnt_cycles/ sleep 1
+
+perf stat -a -e \
+uncore_dmc_0/cnt_cycles/,\
+uncore_dmc_0/data_transfers/,\
+uncore_dmc_0/read_txns/,\
+uncore_dmc_0/write_txns/ sleep 1
+
+perf stat -a -e \
+uncore_l3c_0/read_request/,\
+uncore_l3c_0/read_hit/,\
+uncore_l3c_0/inv_request/,\
+uncore_l3c_0/inv_hit/ sleep 1
+
+The driver does not support sampling, therefore "perf record" will
+not work. Per-task (without "-a") perf sessions are not supported.
+
+L3C events:
+============
+
+read_request:
+	Number of Read requests received by the L3 Cache.
+	This include Read as well as Read Exclusives.
+
+read_hit:
+	Number of Read requests received by the L3 cache that were hit
+	in the L3 (Data provided form the L3)
+
+writeback_request:
+	Number of Write Backs received by the L3 Cache. These are basically
+	the L2 Evicts and writes from the PCIe Write Cache.
+
+inv_nwrite_request:
+	This is the Number of Invalidate and Write received by the L3 Cache.
+	Also Writes from IO that did not go through the PCIe Write Cache.
+
+inv_nwrite_hit
+	This is the Number of Invalidate and Write received by the L3 Cache
+	That were a hit in the L3 Cache.
+
+inv_request:
+	Number of Invalidate request received by the L3 Cache.
+
+inv_hit:
+	Number of Invalidate request received by the L3 Cache that were a
+	hit in L3.
+
+evict_request:
+	Number of Evicts that the L3 generated.
+
+NOTE:
+1. Granularity of all these events counter value is cache line length(64 Bytes).
+2. L3C cache Hit Ratio = (read_hit + inv_nwrite_hit + inv_hit) / (read_request + inv_nwrite_request + inv_request)
+
+DMC events:
+============
+cnt_cycles:
+	Count cycles (Clocks at the DMC clock rate)
+
+write_txns:
+	Number of 64 Bytes write transactions received by the DMC(s)
+
+read_txns:
+	Number of 64 Bytes Read transactions received by the DMC(s)
+
+data_transfers:
+	Number of 64 Bytes data transferred to or from DRAM.