Message ID | 8bd93a25e069ed6428bc6c21fc53270962ccafcc.1455295032.git.jglauber@cavium.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > Provide uncore facilities for non-CPU performance counter units. > Based on Intel/AMD uncore pmu support. > > The uncore PMUs can be found under /sys/bus/event_source/devices. > All counters are exported via sysfs in the corresponding events > files under the PMU directory so the perf tool can list the event names. It turns out that "uncore" covers quite a lot of things. Where exactly do the see counters live? system, socket, cluster? Are there potentially multiple instances of a given PMU in the system? e.g. might each clutster have an instance of an L2 PMU? If I turn off a set of CPUs, do any "uncore" PMUs lost context or become inaccessible? Otherwise, are they associated with some power domain? > There are 2 points that are special in this implementation: > > 1) The PMU detection solely relies on PCI device detection. If a > matching PCI device is found the PMU is created. The code can deal > with multiple units of the same type, e.g. more than one memory > controller. I see below that the driver has an initcall that runs regardless of whether the PCI device exists, and looks at the MIDR. That's clearly not string PCI device detection. Why is this not a true PCI driver that only gets probed if the PCI device exists? > 2) Counters are summarized across the different units of the same type, > e.g. L2C TAD 0..7 is presented as a single counter (adding the > values from TAD 0 to 7). Although losing the ability to read a > single value the merged values are easier to use and yield > enough information. I'm not sure I follow this. What is easier? What are you doing, and what are you comparing that with to say that your approach is easier? It sounds like it should be possible to handle multiple counters like this, so I don't follow why you want to amalgamate them in-kernel. [...] > +#include <asm/cpufeature.h> > +#include <asm/cputype.h> I don't see why you should need these two if this is truly an uncore device probed solely from PCI. > +void thunder_uncore_read(struct perf_event *event) > +{ > + struct thunder_uncore *uncore = event_to_thunder_uncore(event); > + struct hw_perf_event *hwc = &event->hw; > + u64 prev, new = 0; > + s64 delta; > + int i; > + > + /* > + * since we do not enable counter overflow interrupts, > + * we do not have to worry about prev_count changing on us > + */ Without overflow interrupts, how do you ensure that you account for overflow in a reasonable time window (i.e. before the counter runs past its initial value)? > + > + prev = local64_read(&hwc->prev_count); > + > + /* read counter values from all units */ > + for (i = 0; i < uncore->nr_units; i++) > + new += readq(map_offset(hwc->event_base, uncore, i)); There's no bit to determine whether an overflow occurred? > + > + local64_set(&hwc->prev_count, new); > + delta = new - prev; > + local64_add(delta, &event->count); > +} > + > +void thunder_uncore_del(struct perf_event *event, int flags) > +{ > + struct thunder_uncore *uncore = event_to_thunder_uncore(event); > + struct hw_perf_event *hwc = &event->hw; > + int i; > + > + event->pmu->stop(event, PERF_EF_UPDATE); > + > + for (i = 0; i < uncore->num_counters; i++) { > + if (cmpxchg(&uncore->events[i], event, NULL) == event) > + break; > + } > + hwc->idx = -1; > +} Why not just place the event at uncode->events[hwc->idx] ? Theat way removing the event is trivial. > +int thunder_uncore_event_init(struct perf_event *event) > +{ > + struct hw_perf_event *hwc = &event->hw; > + struct thunder_uncore *uncore; > + > + if (event->attr.type != event->pmu->type) > + return -ENOENT; > + > + /* we do not support sampling */ > + if (is_sampling_event(event)) > + return -EINVAL; > + > + /* counters do not have these bits */ > + if (event->attr.exclude_user || > + event->attr.exclude_kernel || > + event->attr.exclude_host || > + event->attr.exclude_guest || > + event->attr.exclude_hv || > + event->attr.exclude_idle) > + return -EINVAL; We should _really_ make these features opt-in at the core level. It's crazy that each and every PMU drivers has to explicitly test and reject things it doesn't support. > + > + /* and we do not enable counter overflow interrupts */ That statement raises far more questions than it answers. _why_ do we not user overflow interrupts? > + > + uncore = event_to_thunder_uncore(event); > + if (!uncore) > + return -ENODEV; > + if (!uncore->event_valid(event->attr.config)) > + return -EINVAL; > + > + hwc->config = event->attr.config; > + hwc->idx = -1; > + > + /* and we don't care about CPU */ Actually, you do. You want the perf core to serialize accesses via the same CPU, so all events _must_ be targetted at the same CPU. Otherwise there are a tonne of problems you don't even want to think about. You _must_ ensure this kernel-side, regardless of what the perf tool happens to do. See the arm-cci and arm-ccn drivers for an example. You can also follow the migration approach used there to allow you to retain counting across a hotplug. [...] > +static int __init thunder_uncore_init(void) > +{ > + unsigned long implementor = read_cpuid_implementor(); > + unsigned long part_number = read_cpuid_part_number(); > + u32 variant; > + > + if (implementor != ARM_CPU_IMP_CAVIUM || > + part_number != CAVIUM_CPU_PART_THUNDERX) > + return -ENODEV; > + > + /* detect pass2 which contains different counters */ > + variant = MIDR_VARIANT(read_cpuid_id()); > + if (variant == 1) > + thunder_uncore_version = 1; > + pr_info("PMU version: %d\n", thunder_uncore_version); > + > + return 0; > +} You should call out these differences in the commmit message. Mark.
On 02/12/2016 09:36 AM, Mark Rutland wrote: > On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: [...] >> 2) Counters are summarized across the different units of the same type, >> e.g. L2C TAD 0..7 is presented as a single counter (adding the >> values from TAD 0 to 7). Although losing the ability to read a >> single value the merged values are easier to use and yield >> enough information. > > I'm not sure I follow this. What is easier? What are you doing, and what > are you comparing that with to say that your approach is easier? > > It sounds like it should be possible to handle multiple counters like > this, so I don't follow why you want to amalgamate them in-kernel. > The values of the individual counters are close to meaningless. The only thing that is meaningful to someone reading the counters is the aggregate sum of all the counts. > [...] > >> +#include <asm/cpufeature.h> >> +#include <asm/cputype.h> > > I don't see why you should need these two if this is truly an uncore > device probed solely from PCI. > >> +void thunder_uncore_read(struct perf_event *event) >> +{ >> + struct thunder_uncore *uncore = event_to_thunder_uncore(event); >> + struct hw_perf_event *hwc = &event->hw; >> + u64 prev, new = 0; >> + s64 delta; >> + int i; >> + >> + /* >> + * since we do not enable counter overflow interrupts, >> + * we do not have to worry about prev_count changing on us >> + */ > > Without overflow interrupts, how do you ensure that you account for > overflow in a reasonable time window (i.e. before the counter runs past > its initial value)? Two reasons: 1) There are no interrupts. 2) The counters are 64-bit, they never overflow. > >> + >> + prev = local64_read(&hwc->prev_count); >> + >> + /* read counter values from all units */ >> + for (i = 0; i < uncore->nr_units; i++) >> + new += readq(map_offset(hwc->event_base, uncore, i)); > > There's no bit to determine whether an overflow occurred? No. > >> + >> + local64_set(&hwc->prev_count, new); >> + delta = new - prev; >> + local64_add(delta, &event->count); >> +} >> + >> +void thunder_uncore_del(struct perf_event *event, int flags) >> +{ >> + struct thunder_uncore *uncore = event_to_thunder_uncore(event); >> + struct hw_perf_event *hwc = &event->hw; >> + int i; >> + >> + event->pmu->stop(event, PERF_EF_UPDATE); >> + >> + for (i = 0; i < uncore->num_counters; i++) { >> + if (cmpxchg(&uncore->events[i], event, NULL) == event) >> + break; >> + } >> + hwc->idx = -1; >> +} > > Why not just place the event at uncode->events[hwc->idx] ? > > Theat way removing the event is trivial. > >> +int thunder_uncore_event_init(struct perf_event *event) >> +{ >> + struct hw_perf_event *hwc = &event->hw; >> + struct thunder_uncore *uncore; >> + >> + if (event->attr.type != event->pmu->type) >> + return -ENOENT; >> + >> + /* we do not support sampling */ >> + if (is_sampling_event(event)) >> + return -EINVAL; >> + >> + /* counters do not have these bits */ >> + if (event->attr.exclude_user || >> + event->attr.exclude_kernel || >> + event->attr.exclude_host || >> + event->attr.exclude_guest || >> + event->attr.exclude_hv || >> + event->attr.exclude_idle) >> + return -EINVAL; > > We should _really_ make these features opt-in at the core level. It's > crazy that each and every PMU drivers has to explicitly test and reject > things it doesn't support. > >> + >> + /* and we do not enable counter overflow interrupts */ > > That statement raises far more questions than it answers. > > _why_ do we not user overflow interrupts? As stated above, there are *no* overflow interrupts. The events we are counting cannot be attributed to any one (or even any) CPU, they run asynchronous to the CPU, so even if there were interrupts, they would be meaningless. David Daney
On Fri, Feb 12, 2016 at 05:47:25PM -0800, David Daney wrote: > On 02/12/2016 09:36 AM, Mark Rutland wrote: > >On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > [...] > >>2) Counters are summarized across the different units of the same type, > >> e.g. L2C TAD 0..7 is presented as a single counter (adding the > >> values from TAD 0 to 7). Although losing the ability to read a > >> single value the merged values are easier to use and yield > >> enough information. > > > >I'm not sure I follow this. What is easier? What are you doing, and what > >are you comparing that with to say that your approach is easier? > > > >It sounds like it should be possible to handle multiple counters like > >this, so I don't follow why you want to amalgamate them in-kernel. > > > > The values of the individual counters are close to meaningless. The > only thing that is meaningful to someone reading the counters is the > aggregate sum of all the counts. I obviously know nowhere near enough about your system to say with certainty, but it may turn out that being able to track counters individually is useful for some profiling/debugging scenario. How meaningful the individual counts are really depends on what you're trying to figure out. If you believe that aggregate values are sufficient, then I'm happy to leave that as-is. > >>+void thunder_uncore_read(struct perf_event *event) > >>+{ > >>+ struct thunder_uncore *uncore = event_to_thunder_uncore(event); > >>+ struct hw_perf_event *hwc = &event->hw; > >>+ u64 prev, new = 0; > >>+ s64 delta; > >>+ int i; > >>+ > >>+ /* > >>+ * since we do not enable counter overflow interrupts, > >>+ * we do not have to worry about prev_count changing on us > >>+ */ > > > >Without overflow interrupts, how do you ensure that you account for > >overflow in a reasonable time window (i.e. before the counter runs past > >its initial value)? > > Two reasons: > > 1) There are no interrupts. > > 2) The counters are 64-bit, they never overflow. Ok. Please point this out in the comment so that reviewers aren't misled. Stating that we don't enable an interrupt implies that said interrupt exists. > >>+ /* and we do not enable counter overflow interrupts */ > > > >That statement raises far more questions than it answers. > > > >_why_ do we not user overflow interrupts? > > As stated above, there are *no* overflow interrupts. Ok. As stated above, please fix this comment to not mislead. > The events we are counting cannot be attributed to any one (or even > any) CPU, they run asynchronous to the CPU, so even if there were > interrupts, they would be meaningless. Yes, they are meaningless w.r.t. the state of an arbitrary CPU. Were they to exist you could use them to drive other snapshotting of the state of the uncore PMU, to get an idea of the frequency/stability of events over time, etc. Userspace might then decide to snapshot other whole system state based on events fed to it. That's moot if they don't exist. Mark.
Hi Mark, thanks for reviewing! I'll need several mails to address all questions. On Fri, Feb 12, 2016 at 05:36:59PM +0000, Mark Rutland wrote: > On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > > Provide uncore facilities for non-CPU performance counter units. > > Based on Intel/AMD uncore pmu support. > > > > The uncore PMUs can be found under /sys/bus/event_source/devices. > > All counters are exported via sysfs in the corresponding events > > files under the PMU directory so the perf tool can list the event names. > > It turns out that "uncore" covers quite a lot of things. > > Where exactly do the see counters live? system, socket, cluster? Neither cluster nor socket, so I would say system. Where a system may consist of 2 nodes that mostly appear as one system. > Are there potentially multiple instances of a given PMU in the system? > e.g. might each clutster have an instance of an L2 PMU? Yes. > If I turn off a set of CPUs, do any "uncore" PMUs lost context or become > inaccessible? No, they are not related to CPUs. [...] > > > > 1) The PMU detection solely relies on PCI device detection. If a > > matching PCI device is found the PMU is created. The code can deal > > with multiple units of the same type, e.g. more than one memory > > controller. > > I see below that the driver has an initcall that runs regardless of > whether the PCI device exists, and looks at the MIDR. That's clearly not > string PCI device detection. > > Why is this not a true PCI driver that only gets probed if the PCI > device exists? It is not a PCI driver because there are already drivers like edac that will access these PCI devices. The uncore driver only accesses the performance counters, which are not used by the other drivers. [...] > > +#include <asm/cpufeature.h> > > +#include <asm/cputype.h> > > I don't see why you should need these two if this is truly an uncore > device probed solely from PCI. There are several passes of the hardware that have the same PCI device ID. Therefore I need the CPU variant to distinguish them. This could be done _after_ the PCI device is found but I found it easier to implement the check once in the common setup function. [...] > > +int thunder_uncore_event_init(struct perf_event *event) > > +{ > > + struct hw_perf_event *hwc = &event->hw; > > + struct thunder_uncore *uncore; > > + > > + if (event->attr.type != event->pmu->type) > > + return -ENOENT; > > + > > + /* we do not support sampling */ > > + if (is_sampling_event(event)) > > + return -EINVAL; > > + > > + /* counters do not have these bits */ > > + if (event->attr.exclude_user || > > + event->attr.exclude_kernel || > > + event->attr.exclude_host || > > + event->attr.exclude_guest || > > + event->attr.exclude_hv || > > + event->attr.exclude_idle) > > + return -EINVAL; > > We should _really_ make these features opt-in at the core level. It's > crazy that each and every PMU drivers has to explicitly test and reject > things it doesn't support. Completely agreed. Also, every sample code I looked at did check for other bits... [...] > > + > > + uncore = event_to_thunder_uncore(event); > > + if (!uncore) > > + return -ENODEV; > > + if (!uncore->event_valid(event->attr.config)) > > + return -EINVAL; > > + > > + hwc->config = event->attr.config; > > + hwc->idx = -1; > > + > > + /* and we don't care about CPU */ > > Actually, you do. You want the perf core to serialize accesses via the > same CPU, so all events _must_ be targetted at the same CPU. Otherwise > there are a tonne of problems you don't even want to think about. I found that perf added the events on every CPU in the system. Because the uncore events are not CPU related I wanted to avoid this. Setting cpumask to -1 did not work. Therefore I added a single CPU in the cpumask, see thunder_uncore_attr_show_cpumask(). > You _must_ ensure this kernel-side, regardless of what the perf tool > happens to do. > > See the arm-cci and arm-ccn drivers for an example. > > You can also follow the migration approach used there to allow you to > retain counting across a hotplug. > > [...] > > > +static int __init thunder_uncore_init(void) > > +{ > > + unsigned long implementor = read_cpuid_implementor(); > > + unsigned long part_number = read_cpuid_part_number(); > > + u32 variant; > > + > > + if (implementor != ARM_CPU_IMP_CAVIUM || > > + part_number != CAVIUM_CPU_PART_THUNDERX) > > + return -ENODEV; > > + > > + /* detect pass2 which contains different counters */ > > + variant = MIDR_VARIANT(read_cpuid_id()); > > + if (variant == 1) > > + thunder_uncore_version = 1; > > + pr_info("PMU version: %d\n", thunder_uncore_version); > > + > > + return 0; > > +} > > You should call out these differences in the commmit message. OK > Mark. Thanks, Jan
On Mon, Feb 15, 2016 at 03:07:20PM +0100, Jan Glauber wrote: > Hi Mark, > > thanks for reviewing! I'll need several mails to address all questions. > > On Fri, Feb 12, 2016 at 05:36:59PM +0000, Mark Rutland wrote: > > On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > > > Provide uncore facilities for non-CPU performance counter units. > > > Based on Intel/AMD uncore pmu support. > > > > > > The uncore PMUs can be found under /sys/bus/event_source/devices. > > > All counters are exported via sysfs in the corresponding events > > > files under the PMU directory so the perf tool can list the event names. > > > > It turns out that "uncore" covers quite a lot of things. > > > > Where exactly do the see counters live? system, socket, cluster? > > Neither cluster nor socket, so I would say system. Where a system may > consist of 2 nodes that mostly appear as one system. > > > Are there potentially multiple instances of a given PMU in the system? > > e.g. might each clutster have an instance of an L2 PMU? > > Yes. > > > If I turn off a set of CPUs, do any "uncore" PMUs lost context or become > > inaccessible? > > No, they are not related to CPUs. Ok. So I should be able to concurrently hotplug random CPUs on/off while this driver is running, without issues? No registers might be clock-gated or similar? I appreciate that they are not "related" to particular CPUs as such. > > > 1) The PMU detection solely relies on PCI device detection. If a > > > matching PCI device is found the PMU is created. The code can deal > > > with multiple units of the same type, e.g. more than one memory > > > controller. > > > > I see below that the driver has an initcall that runs regardless of > > whether the PCI device exists, and looks at the MIDR. That's clearly not > > string PCI device detection. > > > > Why is this not a true PCI driver that only gets probed if the PCI > > device exists? > > It is not a PCI driver because there are already drivers like edac that > will access these PCI devices. The uncore driver only accesses the > performance counters, which are not used by the other drivers. Several drivers are accessing the same device? That sounds somewhat scary. > > > +#include <asm/cpufeature.h> > > > +#include <asm/cputype.h> > > > > I don't see why you should need these two if this is truly an uncore > > device probed solely from PCI. > > There are several passes of the hardware that have the same PCI device > ID. Therefore I need the CPU variant to distinguish them. This could > be done _after_ the PCI device is found but I found it easier to > implement the check once in the common setup function. Ok. Please call that out in the commit message. > > > +int thunder_uncore_event_init(struct perf_event *event) > > > +{ > > > + struct hw_perf_event *hwc = &event->hw; > > > + struct thunder_uncore *uncore; > > > + > > > + if (event->attr.type != event->pmu->type) > > > + return -ENOENT; > > > + > > > + /* we do not support sampling */ > > > + if (is_sampling_event(event)) > > > + return -EINVAL; > > > + > > > + /* counters do not have these bits */ > > > + if (event->attr.exclude_user || > > > + event->attr.exclude_kernel || > > > + event->attr.exclude_host || > > > + event->attr.exclude_guest || > > > + event->attr.exclude_hv || > > > + event->attr.exclude_idle) > > > + return -EINVAL; > > > > We should _really_ make these features opt-in at the core level. It's > > crazy that each and every PMU drivers has to explicitly test and reject > > things it doesn't support. > > Completely agreed. Also, every sample code I looked at did > check for other bits... > > [...] > > > > + > > > + uncore = event_to_thunder_uncore(event); > > > + if (!uncore) > > > + return -ENODEV; > > > + if (!uncore->event_valid(event->attr.config)) > > > + return -EINVAL; > > > + > > > + hwc->config = event->attr.config; > > > + hwc->idx = -1; > > > + > > > + /* and we don't care about CPU */ > > > > Actually, you do. You want the perf core to serialize accesses via the > > same CPU, so all events _must_ be targetted at the same CPU. Otherwise > > there are a tonne of problems you don't even want to think about. > > I found that perf added the events on every CPU in the system. Because > the uncore events are not CPU related I wanted to avoid this. Setting > cpumask to -1 did not work. Therefore I added a single CPU in the > cpumask, see thunder_uncore_attr_show_cpumask(). I understand that, which is why I wrote: > > You _must_ ensure this kernel-side, regardless of what the perf tool > > happens to do. > > > > See the arm-cci and arm-ccn drivers for an example. Take a look at drivers/bus/arm-cci.c; specifically, what we do in cci_pmu_event_init and cci_pmu_cpu_notifier. This is the same thing that's done for x86 system PMUs. Take a look at uncore_pmu_event_init in arch/x86/kernel/cpu/perf_event_intel_uncore.c. Otherwise there are a number of situations where userspace might open events on different CPUs, and you get some freaky results because the perf core expects accesses to a PMU and its related data structures to be strictly serialised through _some_ CPU (even if that CPU is arbitrarily chosen). For example, if CPU 0 was offline when one event was opened, then cpu0 was hotplugged, then a second event was opened, there would be an event in the CPU1 context and another in the CPU0 context. Modification to the PMU state is done per CPU-context in the core, and these would race against each other given the lack of serialisation. Even with sufficient locking to prevent outright corruption, things like event rotation would race leading to very non-deterministic results. Thanks, Mark.
On Mon, Feb 15, 2016 at 02:27:27PM +0000, Mark Rutland wrote: > > > > + uncore = event_to_thunder_uncore(event); > > > > + if (!uncore) > > > > + return -ENODEV; > > > > + if (!uncore->event_valid(event->attr.config)) > > > > + return -EINVAL; > > > > + > > > > + hwc->config = event->attr.config; > > > > + hwc->idx = -1; > > > > + > > > > + /* and we don't care about CPU */ > > > > > > Actually, you do. You want the perf core to serialize accesses via the > > > same CPU, so all events _must_ be targetted at the same CPU. Otherwise > > > there are a tonne of problems you don't even want to think about. > > > > I found that perf added the events on every CPU in the system. Because > > the uncore events are not CPU related I wanted to avoid this. Setting > > cpumask to -1 did not work. Therefore I added a single CPU in the > > cpumask, see thunder_uncore_attr_show_cpumask(). > > I understand that, which is why I wrote: > > > > You _must_ ensure this kernel-side, regardless of what the perf tool > > > happens to do. > > > > > > See the arm-cci and arm-ccn drivers for an example. > > Take a look at drivers/bus/arm-cci.c; specifically, what we do in > cci_pmu_event_init and cci_pmu_cpu_notifier. > > This is the same thing that's done for x86 system PMUs. Take a look at > uncore_pmu_event_init in arch/x86/kernel/cpu/perf_event_intel_uncore.c. I note that we still have an open TODO rather than a call to perf_pmu_migrate_context. The better example is arm_ccn_pmu_cpu_notifier in drivers/bus/arm-ccn.c. Mark.
On Mon, Feb 15, 2016 at 02:27:27PM +0000, Mark Rutland wrote: > > > > 1) The PMU detection solely relies on PCI device detection. If a > > > > matching PCI device is found the PMU is created. The code can deal > > > > with multiple units of the same type, e.g. more than one memory > > > > controller. > > > > > > I see below that the driver has an initcall that runs regardless of > > > whether the PCI device exists, and looks at the MIDR. That's clearly not > > > string PCI device detection. > > > > > > Why is this not a true PCI driver that only gets probed if the PCI > > > device exists? > > > > It is not a PCI driver because there are already drivers like edac that > > will access these PCI devices. The uncore driver only accesses the > > performance counters, which are not used by the other drivers. > > Several drivers are accessing the same device? > > That sounds somewhat scary. I've double checked that the edac drivers do not access the performance counters at all. It felt cleaner to me to keep the performance counter code seperated from edac. Jan
On Fri, Feb 12, 2016 at 05:36:59PM +0000, Mark Rutland wrote: > On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > > Provide uncore facilities for non-CPU performance counter units. > > Based on Intel/AMD uncore pmu support. > > > > The uncore PMUs can be found under /sys/bus/event_source/devices. > > All counters are exported via sysfs in the corresponding events > > files under the PMU directory so the perf tool can list the event names. > > It turns out that "uncore" covers quite a lot of things. > > Where exactly do the see counters live? system, socket, cluster? > > Are there potentially multiple instances of a given PMU in the system? > e.g. might each clutster have an instance of an L2 PMU? Thinking twice about the multi-node systems I would like to change the implementation to not merge counters across nodes. There might be value in having the counters per node and also performance wise it would be better to merge counters only on the local node. I'll introduce a node sysfs attribute that can be combined with the event and fill the node with the numa id. [...] > > Otherwise, are they associated with some power domain? No, power domains are not used for the "uncore" related devices. These devices are currently always on. Jan
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 83cd7e6..c2d2810 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -42,6 +42,7 @@ arm64-obj-$(CONFIG_PCI) += pci.o arm64-obj-$(CONFIG_ARMV8_DEPRECATED) += armv8_deprecated.o arm64-obj-$(CONFIG_ACPI) += acpi.o arm64-obj-$(CONFIG_PARAVIRT) += paravirt.o +arm64-obj-$(CONFIG_ARCH_THUNDER) += uncore/ obj-y += $(arm64-obj-y) vdso/ obj-m += $(arm64-obj-m) diff --git a/arch/arm64/kernel/uncore/Makefile b/arch/arm64/kernel/uncore/Makefile new file mode 100644 index 0000000..b9c72c2 --- /dev/null +++ b/arch/arm64/kernel/uncore/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_ARCH_THUNDER) += uncore_cavium.o diff --git a/arch/arm64/kernel/uncore/uncore_cavium.c b/arch/arm64/kernel/uncore/uncore_cavium.c new file mode 100644 index 0000000..0cfcc83 --- /dev/null +++ b/arch/arm64/kernel/uncore/uncore_cavium.c @@ -0,0 +1,210 @@ +/* + * Cavium Thunder uncore PMU support. Derived from Intel and AMD uncore code. + * + * Copyright (C) 2015,2016 Cavium Inc. + * Author: Jan Glauber <jan.glauber@cavium.com> + */ + +#include <linux/types.h> +#include <linux/slab.h> +#include <linux/init.h> +#include <linux/cpu.h> +#include <linux/io.h> +#include <linux/perf_event.h> +#include <linux/pci.h> + +#include <asm/cpufeature.h> +#include <asm/cputype.h> + +#include "uncore_cavium.h" + +int thunder_uncore_version; + +struct thunder_uncore *event_to_thunder_uncore(struct perf_event *event) +{ + return NULL; +} + +void thunder_uncore_read(struct perf_event *event) +{ + struct thunder_uncore *uncore = event_to_thunder_uncore(event); + struct hw_perf_event *hwc = &event->hw; + u64 prev, new = 0; + s64 delta; + int i; + + /* + * since we do not enable counter overflow interrupts, + * we do not have to worry about prev_count changing on us + */ + + prev = local64_read(&hwc->prev_count); + + /* read counter values from all units */ + for (i = 0; i < uncore->nr_units; i++) + new += readq(map_offset(hwc->event_base, uncore, i)); + + local64_set(&hwc->prev_count, new); + delta = new - prev; + local64_add(delta, &event->count); +} + +void thunder_uncore_del(struct perf_event *event, int flags) +{ + struct thunder_uncore *uncore = event_to_thunder_uncore(event); + struct hw_perf_event *hwc = &event->hw; + int i; + + event->pmu->stop(event, PERF_EF_UPDATE); + + for (i = 0; i < uncore->num_counters; i++) { + if (cmpxchg(&uncore->events[i], event, NULL) == event) + break; + } + hwc->idx = -1; +} + +int thunder_uncore_event_init(struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; + struct thunder_uncore *uncore; + + if (event->attr.type != event->pmu->type) + return -ENOENT; + + /* we do not support sampling */ + if (is_sampling_event(event)) + return -EINVAL; + + /* counters do not have these bits */ + if (event->attr.exclude_user || + event->attr.exclude_kernel || + event->attr.exclude_host || + event->attr.exclude_guest || + event->attr.exclude_hv || + event->attr.exclude_idle) + return -EINVAL; + + /* and we do not enable counter overflow interrupts */ + + uncore = event_to_thunder_uncore(event); + if (!uncore) + return -ENODEV; + if (!uncore->event_valid(event->attr.config)) + return -EINVAL; + + hwc->config = event->attr.config; + hwc->idx = -1; + + /* and we don't care about CPU */ + + return 0; +} + +static cpumask_t thunder_active_mask; + +static ssize_t thunder_uncore_attr_show_cpumask(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + cpumask_t *active_mask = &thunder_active_mask; + + /* + * Thunder uncore events are independent from CPUs. Provide a cpumask + * nevertheless to prevent perf from adding the event per-cpu and just + * set the mask to one online CPU. + */ + cpumask_set_cpu(cpumask_first(cpu_online_mask), active_mask); + + return cpumap_print_to_pagebuf(true, buf, active_mask); +} +static DEVICE_ATTR(cpumask, S_IRUGO, thunder_uncore_attr_show_cpumask, NULL); + +static struct attribute *thunder_uncore_attrs[] = { + &dev_attr_cpumask.attr, + NULL, +}; + +struct attribute_group thunder_uncore_attr_group = { + .attrs = thunder_uncore_attrs, +}; + +ssize_t thunder_events_sysfs_show(struct device *dev, + struct device_attribute *attr, + char *page) +{ + struct perf_pmu_events_attr *pmu_attr = + container_of(attr, struct perf_pmu_events_attr, attr); + + if (pmu_attr->event_str) + return sprintf(page, "%s", pmu_attr->event_str); + + return 0; +} + +int __init thunder_uncore_setup(struct thunder_uncore *uncore, int id, + unsigned long offset, unsigned long size, + struct pmu *pmu) +{ + struct pci_dev *pdev = NULL; + pci_bus_addr_t start; + int ret, node = 0; + + /* detect PCI devices */ + do { + pdev = pci_get_device(PCI_VENDOR_ID_CAVIUM, id, pdev); + if (!pdev) + break; + start = pci_resource_start(pdev, 0); + uncore->pdevs[node].pdev = pdev; + uncore->pdevs[node].base = start; + uncore->pdevs[node].map = ioremap(start + offset, size); + node++; + if (node >= MAX_NR_UNCORE_PDEVS) { + pr_err("reached pdev limit\n"); + break; + } + } while (1); + + if (!node) + return -ENODEV; + + uncore->nr_units = node; + + ret = perf_pmu_register(pmu, pmu->name, -1); + if (ret) + goto fail; + + uncore->pmu = pmu; + return 0; + +fail: + for (node = 0; node < MAX_NR_UNCORE_PDEVS; node++) { + pdev = uncore->pdevs[node].pdev; + if (!pdev) + break; + iounmap(uncore->pdevs[node].map); + pci_dev_put(pdev); + } + return ret; +} + +static int __init thunder_uncore_init(void) +{ + unsigned long implementor = read_cpuid_implementor(); + unsigned long part_number = read_cpuid_part_number(); + u32 variant; + + if (implementor != ARM_CPU_IMP_CAVIUM || + part_number != CAVIUM_CPU_PART_THUNDERX) + return -ENODEV; + + /* detect pass2 which contains different counters */ + variant = MIDR_VARIANT(read_cpuid_id()); + if (variant == 1) + thunder_uncore_version = 1; + pr_info("PMU version: %d\n", thunder_uncore_version); + + return 0; +} +late_initcall(thunder_uncore_init); diff --git a/arch/arm64/kernel/uncore/uncore_cavium.h b/arch/arm64/kernel/uncore/uncore_cavium.h new file mode 100644 index 0000000..acd121d --- /dev/null +++ b/arch/arm64/kernel/uncore/uncore_cavium.h @@ -0,0 +1,73 @@ +#include <linux/types.h> +#include <linux/slab.h> +#include <linux/init.h> +#include <linux/cpu.h> +#include <linux/io.h> +#include <linux/perf_event.h> +#include <linux/pci.h> + +#include <asm/cpufeature.h> +#include <asm/cputype.h> + +#undef pr_fmt +#define pr_fmt(fmt) "thunderx_uncore: " fmt + +enum uncore_type { + NOP_TYPE, +}; + +extern int thunder_uncore_version; + +#define MAX_NR_UNCORE_PDEVS 16 + +/* maximum number of parallel hardware counters for all uncore parts */ +#define MAX_COUNTERS 64 + +/* generic uncore struct for different pmu types */ +struct thunder_uncore { + int num_counters; + int nr_units; + int type; + struct pmu *pmu; + int (*event_valid)(u64); + struct { + unsigned long base; + void __iomem *map; + struct pci_dev *pdev; + } pdevs[MAX_NR_UNCORE_PDEVS]; + struct perf_event *events[MAX_COUNTERS]; +}; + +#define EVENT_PTR(_id) (&event_attr_##_id.attr.attr) + +#define EVENT_ATTR(_name, _val) \ +static struct perf_pmu_events_attr event_attr_##_name = { \ + .attr = __ATTR(_name, 0444, thunder_events_sysfs_show, NULL), \ + .event_str = "event=" __stringify(_val), \ +}; + +#define EVENT_ATTR_STR(_name, _str) \ +static struct perf_pmu_events_attr event_attr_##_name = { \ + .attr = __ATTR(_name, 0444, thunder_events_sysfs_show, NULL), \ + .event_str = _str, \ +}; + +static inline void __iomem *map_offset(unsigned long addr, + struct thunder_uncore *uncore, int unit) +{ + return (void __iomem *) (addr + uncore->pdevs[unit].map); +} + +extern struct attribute_group thunder_uncore_attr_group; + +/* Prototypes */ +struct thunder_uncore *event_to_thunder_uncore(struct perf_event *event); +void thunder_uncore_del(struct perf_event *event, int flags); +int thunder_uncore_event_init(struct perf_event *event); +void thunder_uncore_read(struct perf_event *event); +int thunder_uncore_setup(struct thunder_uncore *uncore, int id, + unsigned long offset, unsigned long size, + struct pmu *pmu); +ssize_t thunder_events_sysfs_show(struct device *dev, + struct device_attribute *attr, + char *page);
Provide uncore facilities for non-CPU performance counter units. Based on Intel/AMD uncore pmu support. The uncore PMUs can be found under /sys/bus/event_source/devices. All counters are exported via sysfs in the corresponding events files under the PMU directory so the perf tool can list the event names. There are 2 points that are special in this implementation: 1) The PMU detection solely relies on PCI device detection. If a matching PCI device is found the PMU is created. The code can deal with multiple units of the same type, e.g. more than one memory controller. 2) Counters are summarized across the different units of the same type, e.g. L2C TAD 0..7 is presented as a single counter (adding the values from TAD 0 to 7). Although losing the ability to read a single value the merged values are easier to use and yield enough information. Signed-off-by: Jan Glauber <jglauber@cavium.com> --- arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/uncore/Makefile | 1 + arch/arm64/kernel/uncore/uncore_cavium.c | 210 +++++++++++++++++++++++++++++++ arch/arm64/kernel/uncore/uncore_cavium.h | 73 +++++++++++ 4 files changed, 285 insertions(+) create mode 100644 arch/arm64/kernel/uncore/Makefile create mode 100644 arch/arm64/kernel/uncore/uncore_cavium.c create mode 100644 arch/arm64/kernel/uncore/uncore_cavium.h