Message ID | ce077a0db5d4afdbcc63a635fece9793aaae055f.1716205838.git.shravankr@nvidia.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | Updates to mlxbf-pmc | expand |
On Mon, 20 May 2024, Shravan Kumar Ramani wrote: > Add support for programming any counter to monitor the cycle count. > Since counting of cycles using 32-bit ocunters would result in frequent counters > wraparounds, add the ability to combine 2 adjacent 32-bit counters to > form 1 64-bit counter. > Both these features are supported by BlueField-3 PMC hardware, hence > the required bit-fields are exposed by the driver via sysfs to allow > the user to configure as needed. I'm trying to understand what happens for the other counter, when the use_odd_counter is enabled? This change also doesn't add code that would make the other counter -EBUSY, should that be done? For 64-bit counter, I suppose the userspace is expected to read the full counter from two sysfs files and combine the value (your documentation doesn't explain this)? That seems non-optimal, why cannot kernel just return the full combined 64-value directly in kernel? Similarly, are these cycle counters occupying the same space as non-cycle counters (so both can/cannot be used that the same time)? I'm asking this because you're adding a parallel interface to read the value and if it's either-or, I don't understand why the value needs to be read from different file depending on the counter counting in cycles or not.
> > Both these features are supported by BlueField-3 PMC hardware, hence > > the required bit-fields are exposed by the driver via sysfs to allow > > the user to configure as needed. > > I'm trying to understand what happens for the other counter, when the > use_odd_counter is enabled? This change also doesn't add code that would > make the other counter -EBUSY, should that be done? When 2 32-bit counters are coupled to form a 64-bit counter using this setting, one counter will hold the lower 32 bits while the other will hold the upper 32. So the other counter (or syses corresponding to it) also needs to be accessed. > For 64-bit counter, I suppose the userspace is expected to read the full > counter from two sysfs files and combine the value (your documentation > doesn't explain this)? That seems non-optimal, why cannot kernel just > return the full combined 64-value directly in kernel? I will add more clear comments for this. While it is true that the driver could combine the 2 fields and present a 64-bit value via one of the sysfs, the reason for the current approach is that there are other interfaces which expose the same counters for our platform and there are tools that are expected to work on top of both interfaces for the purpose of collecting performance stats. The other interfaces follow this approach of having lower and upper 32-bits separately in each counter, and the tools expect the same. Hence the driver follows this approach to keep things consistent across the BlueField platform. > > Similarly, are these cycle counters occupying the same space as non-cycle > counters (so both can/cannot be used that the same time)? I'm asking this > because you're adding a parallel interface to read the value and if it's > either-or, I don't understand why the value needs to be read from > different file depending on the counter counting in cycles or not. It is the same file. The count_clock sysfs exposes 16 bits, one for each counter, to allow the user to dedicate any of the 16 counters to counting cycles. Once set, the corresponding counter can no longer monitor other events, and the same sysfs can be accessed to read the cycle count. Again, I will try capture this better and more elaborately in the documentation. Thanks, Shravan
On Mon, 3 Jun 2024, Shravan Ramani wrote: > > > Both these features are supported by BlueField-3 PMC hardware, hence > > > the required bit-fields are exposed by the driver via sysfs to allow > > > the user to configure as needed. > > > > I'm trying to understand what happens for the other counter, when the > > use_odd_counter is enabled? This change also doesn't add code that would > > make the other counter -EBUSY, should that be done? > > When 2 32-bit counters are coupled to form a 64-bit counter using this setting, > one counter will hold the lower 32 bits while the other will hold the upper 32. > So the other counter (or syses corresponding to it) also needs to be accessed. > > > For 64-bit counter, I suppose the userspace is expected to read the full > > counter from two sysfs files and combine the value (your documentation > > doesn't explain this)? That seems non-optimal, why cannot kernel just > > return the full combined 64-value directly in kernel? > > I will add more clear comments for this. > While it is true that the driver could combine the 2 fields and present a > 64-bit value via one of the sysfs, the reason for the current approach is that > there are other interfaces which expose the same counters for our platform > and there are tools that are expected to work on top of both interfaces for > the purpose of collecting performance stats. > The other interfaces follow this > approach of having lower and upper 32-bits separately in each counter, and > the tools expect the same. Hence the driver follows this approach to keep > things consistent across the BlueField platform. Hi, I went to look through the existing arrays in mlxbf-pmc.c but did not find any entries that would have clearly indicated the counters being hi/lo parts of the same counter. There were a few 0/1 ones which could be the same counter although I suspect even they are not parts of the same counter but two separate entities called 0 and 1 having the same counter. Could you please elaborate further what you meant with the note about other interfaces above so I can better assess the claim?
> > When 2 32-bit counters are coupled to form a 64-bit counter using this setting, > > one counter will hold the lower 32 bits while the other will hold the upper 32. > > So the other counter (or syses corresponding to it) also needs to be accessed. > > > > > For 64-bit counter, I suppose the userspace is expected to read the full > > > counter from two sysfs files and combine the value (your documentation > > > doesn't explain this)? That seems non-optimal, why cannot kernel just > > > return the full combined 64-value directly in kernel? > > > > I will add more clear comments for this. > > While it is true that the driver could combine the 2 fields and present a > > 64-bit value via one of the sysfs, the reason for the current approach is that > > there are other interfaces which expose the same counters for our platform > > and there are tools that are expected to work on top of both interfaces for > > the purpose of collecting performance stats. > > > The other interfaces follow this > > approach of having lower and upper 32-bits separately in each counter, and > > the tools expect the same. Hence the driver follows this approach to keep > > things consistent across the BlueField platform. > > Hi, > > I went to look through the existing arrays in mlxbf-pmc.c but did not find > any entries that would have clearly indicated the counters being hi/lo > parts of the same counter. There were a few 0/1 ones which could be the > same counter although I suspect even they are not parts of the same > counter but two separate entities called 0 and 1 having the same counter. > > Could you please elaborate further what you meant with the note about > other interfaces above so I can better assess the claim? When combining 2 counters using the "use_odd_counter" setting, the mechanism of joining them or assigning upper or lower 32 bits to a counter is handled in HW and not by the driver. For example, if bit0 of "use_odd_counter" is set, counter0 and counter1 (which were originally separate counters) automatically become the lower and upper bits of one 64-bit value. The user needs to read both these sysfs separately to get the full 64-bit value. The driver does not do any special handling for such cases, merely provides access to both counter0 and counter1. Since the events supported by the blocks are quite HW centric and low-level in nature, the driver is generally used alongside various tools which work on top of this driver to collect telemetry info and provide more readable statistics to the end-user. Similar to this driver, there are other FW interfaces providing access to these counters (same and other additional ones as well that belong to other HW blocks). For the sake of consistency and to allow the tools to be compatible with all interfaces, the counter data needs to be accessible in the same way, ie, as 32-bit upper and lower values in counter0 and counter1 sysfs as in the above case. Thanks, Shravan
On Tue, 11 Jun 2024, Shravan Ramani wrote: > > > When 2 32-bit counters are coupled to form a 64-bit counter using this setting, > > > one counter will hold the lower 32 bits while the other will hold the upper 32. > > > So the other counter (or syses corresponding to it) also needs to be accessed. > > > > > > > For 64-bit counter, I suppose the userspace is expected to read the full > > > > counter from two sysfs files and combine the value (your documentation > > > > doesn't explain this)? That seems non-optimal, why cannot kernel just > > > > return the full combined 64-value directly in kernel? > > > > > > I will add more clear comments for this. > > > While it is true that the driver could combine the 2 fields and present a > > > 64-bit value via one of the sysfs, the reason for the current approach is that > > > there are other interfaces which expose the same counters for our platform > > > and there are tools that are expected to work on top of both interfaces for > > > the purpose of collecting performance stats. > > > > > The other interfaces follow this > > > approach of having lower and upper 32-bits separately in each counter, and > > > the tools expect the same. Hence the driver follows this approach to keep > > > things consistent across the BlueField platform. > > > > Hi, > > > > I went to look through the existing arrays in mlxbf-pmc.c but did not find > > any entries that would have clearly indicated the counters being hi/lo > > parts of the same counter. There were a few 0/1 ones which could be the > > same counter although I suspect even they are not parts of the same > > counter but two separate entities called 0 and 1 having the same counter. > > > > Could you please elaborate further what you meant with the note about > > other interfaces above so I can better assess the claim? > > When combining 2 counters using the "use_odd_counter" setting, the mechanism > of joining them or assigning upper or lower 32 bits to a counter is handled in HW > and not by the driver. For example, if bit0 of "use_odd_counter" is set, counter0 > and counter1 (which were originally separate counters) automatically become > the lower and upper bits of one 64-bit value. The user needs to read both these > sysfs separately to get the full 64-bit value. The driver does not do any special > handling for such cases, merely provides access to both counter0 and counter1. I know all this by now, but we're discussion here is whether kernel should do "special handling". Although, it's not really correct to depict representing 64-bit counter in its entirety as "special handling". I think the kernel should combine the 64-bit halved and you argumented it shouldn't. When I went to confirm the claim your argument was based on, I couldn't find on what basis the claim was made. > Since the events supported by the blocks are quite HW centric and low-level in > nature, the driver is generally used alongside various tools which work on top of > this driver to collect telemetry info and provide more readable statistics to the > end-user. Similar to this driver, there are other FW interfaces providing access to > these counters (same and other additional ones as well that belong to other HW > blocks). For the sake of consistency and to allow the tools to be compatible with > all interfaces, the counter data needs to be accessible in the same way, ie, as 32-bit > upper and lower values in counter0 and counter1 sysfs as in the above case. This does nothing to answer my question. Where in the kernel, there's an example where a 64-bit counter for BlueField platform is presented as 2 32-bit counters? If there isn't any examples in the kernel, your statement about consistency within the platform doesn't hold water, quoted (again) here for clarity what I'm refering to: "The other interfaces follow this approach of having lower and upper 32-bits separately in each counter, and the tools expect the same. Hence the driver follows this approach to keep things consistent across the BlueField platform." Where I can find those "other interfaces" that already follow this convention?
> This does nothing to answer my question. Where in the kernel, there's an > example where a 64-bit counter for BlueField platform is presented as 2 > 32-bit counters? If there isn't any examples in the kernel, your statement > about consistency within the platform doesn't hold water, quoted (again) > here for clarity what I'm refering to: > > "The other interfaces follow this approach of having lower and upper > 32-bits separately in each counter, and the tools expect the same. > Hence the driver follows this approach to keep things consistent across > the BlueField platform." > > Where I can find those "other interfaces" that already follow this > convention? Ah, I think I misunderstood the question and went on elaborating the same thing, apologies. The other interfaces are not part of the kernel. They are part of the BlueField Software Package, which also contains the tools that put together the performance metrics. My thinking was that since this is a platform driver and is used along with the BlueField Software Package, consistency with the tools which were developed following the same convention could be considered, as long as the driver is not doing something non-standard, of course. I can change the driver handling to present 64-bit data if you insist. Thanks, Shravan
On Fri, 14 Jun 2024, Shravan Ramani wrote: > > This does nothing to answer my question. Where in the kernel, there's an > > example where a 64-bit counter for BlueField platform is presented as 2 > > 32-bit counters? If there isn't any examples in the kernel, your statement > > about consistency within the platform doesn't hold water, quoted (again) > > here for clarity what I'm refering to: > > > > "The other interfaces follow this approach of having lower and upper > > 32-bits separately in each counter, and the tools expect the same. > > Hence the driver follows this approach to keep things consistent across > > the BlueField platform." > > > > Where I can find those "other interfaces" that already follow this > > convention? > > Ah, I think I misunderstood the question and went on elaborating the > same thing, apologies. The other interfaces are not part of the kernel. > They are part of the BlueField Software Package, which also contains > the tools that put together the performance metrics. > My thinking was that since this is a platform driver and is used along > with the BlueField Software Package, consistency with the tools which > were developed following the same convention could be considered, > as long as the driver is not doing something non-standard, of course. > I can change the driver handling to present 64-bit data if you insist. I'd certainly prefer 64-bit data be presented as such by the kernel. While you make that change, please make sure the driver correctly handles the lower dword wraps without returning an inconsistent reading (assuming the counter parts are read non-atomically, it is a common pitfall).
diff --git a/drivers/platform/mellanox/mlxbf-pmc.c b/drivers/platform/mellanox/mlxbf-pmc.c index 4ed9c7fd2b62..635ecc3b3845 100644 --- a/drivers/platform/mellanox/mlxbf-pmc.c +++ b/drivers/platform/mellanox/mlxbf-pmc.c @@ -88,6 +88,8 @@ #define MLXBF_PMC_CRSPACE_PERFMON_CTL(n) (n * MLXBF_PMC_CRSPACE_PERFMON_REG0_SZ) #define MLXBF_PMC_CRSPACE_PERFMON_EN BIT(30) #define MLXBF_PMC_CRSPACE_PERFMON_CLR BIT(28) +#define MLXBF_PMC_CRSPACE_PERFMON_UOC GENMASK(15, 0) +#define MLXBF_PMC_CRSPACE_PERFMON_COUNT_CLOCK(n) (MLXBF_PMC_CRSPACE_PERFMON_CTL(n) + 0x4) #define MLXBF_PMC_CRSPACE_PERFMON_VAL0(n) (MLXBF_PMC_CRSPACE_PERFMON_CTL(n) + 0xc) /** @@ -114,6 +116,8 @@ struct mlxbf_pmc_attribute { * @attr_event: Attributes for "event" sysfs files * @attr_event_list: Attributes for "event_list" sysfs files * @attr_enable: Attributes for "enable" sysfs files + * @attr_use_odd_counter: Attributes for "use_odd_counter" sysfs files + * @attr_count_clock: Attributes for "count_clock" sysfs files * @block_attr: All attributes needed for the block * @block_attr_grp: Attribute group for the block */ @@ -126,6 +130,8 @@ struct mlxbf_pmc_block_info { struct mlxbf_pmc_attribute *attr_event; struct mlxbf_pmc_attribute attr_event_list; struct mlxbf_pmc_attribute attr_enable; + struct mlxbf_pmc_attribute attr_use_odd_counter; + struct mlxbf_pmc_attribute attr_count_clock; struct attribute *block_attr[MLXBF_PMC_MAX_ATTRS]; struct attribute_group block_attr_grp; }; @@ -1763,6 +1769,103 @@ static ssize_t mlxbf_pmc_enable_store(struct device *dev, return count; } +/* Show function for "use_odd_counter" sysfs files - only for crspace */ +static ssize_t mlxbf_pmc_use_odd_counter_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct mlxbf_pmc_attribute *attr_use_odd_counter = container_of( + attr, struct mlxbf_pmc_attribute, dev_attr); + unsigned int blk_num; + u32 value, reg; + + blk_num = attr_use_odd_counter->nr; + + if (mlxbf_pmc_readl(pmc->block[blk_num].mmio_base + + MLXBF_PMC_CRSPACE_PERFMON_CTL(pmc->block[blk_num].counters), + ®)) + return -EINVAL; + + value = FIELD_GET(MLXBF_PMC_CRSPACE_PERFMON_UOC, reg); + + return sysfs_emit(buf, "%u\n", value); +} + +/* Store function for "use_odd_counter" sysfs files - only for crspace */ +static ssize_t mlxbf_pmc_use_odd_counter_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct mlxbf_pmc_attribute *attr_use_odd_counter = container_of( + attr, struct mlxbf_pmc_attribute, dev_attr); + unsigned int blk_num; + u32 uoc, reg; + int err; + + blk_num = attr_use_odd_counter->nr; + + err = kstrtouint(buf, 0, &uoc); + if (err < 0) + return err; + + err = mlxbf_pmc_readl(pmc->block[blk_num].mmio_base + + MLXBF_PMC_CRSPACE_PERFMON_CTL(pmc->block[blk_num].counters), + ®); + if (err) + return -EINVAL; + + reg &= ~MLXBF_PMC_CRSPACE_PERFMON_UOC; + reg |= FIELD_PREP(MLXBF_PMC_CRSPACE_PERFMON_UOC, uoc); + + mlxbf_pmc_write(pmc->block[blk_num].mmio_base + + MLXBF_PMC_CRSPACE_PERFMON_CTL(pmc->block[blk_num].counters), + MLXBF_PMC_WRITE_REG_32, reg); + + return count; +} + +/* Show function for "count_clock" sysfs files - only for crspace */ +static ssize_t mlxbf_pmc_count_clock_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct mlxbf_pmc_attribute *attr_count_clock = container_of( + attr, struct mlxbf_pmc_attribute, dev_attr); + unsigned int blk_num; + u32 reg; + + blk_num = attr_count_clock->nr; + + if (mlxbf_pmc_readl(pmc->block[blk_num].mmio_base + + MLXBF_PMC_CRSPACE_PERFMON_COUNT_CLOCK(pmc->block[blk_num].counters), + ®)) + return -EINVAL; + + return sysfs_emit(buf, "%u\n", reg); +} + +/* Store function for "count_clock" sysfs files - only for crspace */ +static ssize_t mlxbf_pmc_count_clock_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct mlxbf_pmc_attribute *attr_count_clock = container_of( + attr, struct mlxbf_pmc_attribute, dev_attr); + unsigned int blk_num; + u32 reg; + int err; + + blk_num = attr_count_clock->nr; + + err = kstrtouint(buf, 0, ®); + if (err < 0) + return err; + + mlxbf_pmc_write(pmc->block[blk_num].mmio_base + + MLXBF_PMC_CRSPACE_PERFMON_COUNT_CLOCK(pmc->block[blk_num].counters), + MLXBF_PMC_WRITE_REG_32, reg); + + return count; +} + /* Populate attributes for blocks with counters to monitor performance */ static int mlxbf_pmc_init_perftype_counter(struct device *dev, unsigned int blk_num) { @@ -1799,6 +1902,37 @@ static int mlxbf_pmc_init_perftype_counter(struct device *dev, unsigned int blk_ attr = NULL; } + if (pmc->block[blk_num].type == MLXBF_PMC_TYPE_CRSPACE) { + /* + * Couple adjacent odd and even 32-bit counters to form 64-bit counters + * using "use_odd_counter" sysfs which has one bit per even counter. + */ + attr = &pmc->block[blk_num].attr_use_odd_counter; + attr->dev_attr.attr.mode = 0644; + attr->dev_attr.show = mlxbf_pmc_use_odd_counter_show; + attr->dev_attr.store = mlxbf_pmc_use_odd_counter_store; + attr->nr = blk_num; + attr->dev_attr.attr.name = devm_kasprintf(dev, GFP_KERNEL, + "use_odd_counter"); + if (!attr->dev_attr.attr.name) + return -ENOMEM; + pmc->block[blk_num].block_attr[++i] = &attr->dev_attr.attr; + attr = NULL; + + /* Program crspace counters to count clock cycles using "count_clock" sysfs */ + attr = &pmc->block[blk_num].attr_count_clock; + attr->dev_attr.attr.mode = 0644; + attr->dev_attr.show = mlxbf_pmc_count_clock_show; + attr->dev_attr.store = mlxbf_pmc_count_clock_store; + attr->nr = blk_num; + attr->dev_attr.attr.name = devm_kasprintf(dev, GFP_KERNEL, + "count_clock"); + if (!attr->dev_attr.attr.name) + return -ENOMEM; + pmc->block[blk_num].block_attr[++i] = &attr->dev_attr.attr; + attr = NULL; + } + pmc->block[blk_num].attr_counter = devm_kcalloc( dev, pmc->block[blk_num].counters, sizeof(struct mlxbf_pmc_attribute), GFP_KERNEL);