diff mbox series

PCI/AER: Iterate over error counters instead of error strings

Message ID 20220509181441.31884-1-mkhalfella@purestorage.com (mailing list archive)
State New
Headers show
Series PCI/AER: Iterate over error counters instead of error strings | expand

Commit Message

Mohamed Khalfella May 9, 2022, 6:14 p.m. UTC
PCI AER stats counters sysfs attributes need to iterate over
stats counters instead of stats names. Also, added a build
time check to make sure all counters have entries in strings
array.

Fixes: 0678e3109a3c ("PCI/AER: Simplify __aer_print_error()")
Cc: stable@vger.kernel.org
Reported-by: Meeta Saggi <msaggi@purestorage.com>
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Meeta Saggi <msaggi@purestorage.com>
Reviewed-by: Eric Badger <ebadger@purestorage.com>
---
 drivers/pci/pcie/aer.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Bjorn Helgaas May 10, 2022, 4:43 p.m. UTC | #1
[+cc Rajat]

On Mon, May 09, 2022 at 06:14:41PM +0000, Mohamed Khalfella wrote:
> PCI AER stats counters sysfs attributes need to iterate over
> stats counters instead of stats names. 

Thanks for catching this; it definitely looks like a real issue!  I
guess you're probably seeing junk in the sysfs files?

It would be helpful to reviewers if this said *why* we need to iterate
over the counters instead of the names.  I think the problem is that
the current code reads past the end of the stats counters.

There are parallel arrays here:

  #define AER_MAX_TYPEOF_COR_ERRS         16
  #define AER_MAX_TYPEOF_UNCOR_ERRS       27

  aer_correctable_error_string[32]                               # 32
  pdev->aer_stats->dev_cor_errs[AER_MAX_TYPEOF_COR_ERRS]         # 16
  aer_uncorrectable_error_string[32]                             # 32
  pdev->aer_stats->dev_fatal_errs[AER_MAX_TYPEOF_UNCOR_ERRS]     # 27
  pdev->aer_stats->dev_nonfatal_errs[AER_MAX_TYPEOF_UNCOR_ERRS]  # 27

And here's the current use of them:

  #define aer_stats_dev_attr(..., stats_array, strings_array, ...)
    for (i = 0; i < ARRAY_SIZE(strings_array); i++) {
      if (strings_array[i])
	sysfs_emit_at(..., strings_array[i], stats[i]);          (1)
      else if (stats[i])
	sysfs_emit_at(..., stats[i]);                            (2)

  aer_stats_dev_attr(..., dev_cor_errs, aer_correctable_error_string,
  aer_stats_dev_attr(..., dev_fatal_errs, aer_uncorrectable_error_string,
  aer_stats_dev_attr(..., dev_nonfatal_errs, aer_uncorrectable_error_string,

The current loop iterates over 0..31, which is safe at (1) because the
non-NULL strings are at aer_correctable_error_string[0..15] and
aer_uncorrectable_error_string[0..26].

But it is unsafe at (2) because it references dev_cor_errs[16..31],
dev_fatal_errs[27..31], and dev_nonfatal_errs[27..31], which are past
the end of the arrays.

> Also, added a build time check to make sure all counters have
> entries in strings array.
>
> Fixes: 0678e3109a3c ("PCI/AER: Simplify __aer_print_error()")

Yep, I blew it there.  Rajat did it correctly when he added this with
81aa5206f9a7 ("PCI/AER: Add sysfs attributes to provide AER stats and
breakdown"), and I broke it by extending the string arrays to 32
entries.

> Cc: stable@vger.kernel.org
> Reported-by: Meeta Saggi <msaggi@purestorage.com>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> Reviewed-by: Meeta Saggi <msaggi@purestorage.com>
> Reviewed-by: Eric Badger <ebadger@purestorage.com>
> ---
>  drivers/pci/pcie/aer.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 9fa1f97e5b27..ce99a6d44786 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -533,7 +533,7 @@ static const char *aer_agent_string[] = {
>  	u64 *stats = pdev->aer_stats->stats_array;			\
>  	size_t len = 0;							\
>  									\
> -	for (i = 0; i < ARRAY_SIZE(strings_array); i++) {		\
> +	for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\
>  		if (strings_array[i])					\
>  			len += sysfs_emit_at(buf, len, "%s %llu\n",	\
>  					     strings_array[i],		\

I think maybe we should populate the currently NULL entries in the
string[] arrays and simplify the code here, e.g.,

  static const char *aer_correctable_error_string[] = {
        "RxErr",                        /* Bit Position 0       */
        "dev_cor_errs_bit[1]",
	...

  if (stats[i])
    len += sysfs_emit_at(buf, len, "%s %llu\n", strings_array[i], stats[i]);

It's a little more data space, but easier to verify.

> @@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev)
>  	struct device *device = &dev->device;
>  	struct pci_dev *port = dev->port;
>  
> +	BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
> +		     AER_MAX_TYPEOF_COR_ERRS);
> +	BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
> +		     AER_MAX_TYPEOF_UNCOR_ERRS);

And make these check for "!=" instead of "<".
Mohamed Khalfella May 10, 2022, 9:17 p.m. UTC | #2
> Thanks for catching this; it definitely looks like a real issue!  I
> guess you're probably seeing junk in the sysfs files?

That is correct. The initial report was seeing junk when reading sysfs
files. As descibed, this is happening because we reading data past the
end of the stats counters array.


> I think maybe we should populate the currently NULL entries in the
> string[] arrays and simplify the code here, e.g.,
> 
> static const char *aer_correctable_error_string[] = {
>        "RxErr",                        /* Bit Position 0       */
>        "dev_cor_errs_bit[1]",
>	...
>
>  if (stats[i])
>    len += sysfs_emit_at(buf, len, "%s %llu\n", strings_array[i], stats[i]);

Doing it this way will change the output format. In this case we will show
stats only if their value is greater than zero. The current code shows all the
stats those have names (regardless of their value) plus those have non-zero
values.

>> @@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev)
>>  	struct device *device = &dev->device;
>>  	struct pci_dev *port = dev->port;
>>
>> +	BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
>> +		     AER_MAX_TYPEOF_COR_ERRS);
>> +	BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
>> +		     AER_MAX_TYPEOF_UNCOR_ERRS);
>
> And make these check for "!=" instead of "<".

This will require unnecessarily extending stats arrays to have 32 entries
in order to match names arrays. If you don't feel strogly about changing
"<" to "!=", I prefer to keep the code as it is.
Mohamed Khalfella June 3, 2022, 10:12 p.m. UTC | #3
Is there any chance for this to land in 5.19?

On 5/10/22 14:17, Mohamed Khalfella wrote:
> > Thanks for catching this; it definitely looks like a real issue!  I
> > guess you're probably seeing junk in the sysfs files?
> 
> That is correct. The initial report was seeing junk when reading sysfs
> files. As descibed, this is happening because we reading data past the
> end of the stats counters array.
> 
> 
> > I think maybe we should populate the currently NULL entries in the
> > string[] arrays and simplify the code here, e.g.,
> > 
> > static const char *aer_correctable_error_string[] = {
> >        "RxErr",                        /* Bit Position 0       */
> >        "dev_cor_errs_bit[1]",
> >	...
> >
> >  if (stats[i])
> >    len += sysfs_emit_at(buf, len, "%s %llu\n", strings_array[i], stats[i]);
> 
> Doing it this way will change the output format. In this case we will show
> stats only if their value is greater than zero. The current code shows all the
> stats those have names (regardless of their value) plus those have non-zero
> values.
> 
> >> @@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev)
> >>  	struct device *device = &dev->device;
> >>  	struct pci_dev *port = dev->port;
> >>
> >> +	BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
> >> +		     AER_MAX_TYPEOF_COR_ERRS);
> >> +	BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
> >> +		     AER_MAX_TYPEOF_UNCOR_ERRS);
> >
> > And make these check for "!=" instead of "<".

I am happy to remove these BUILD_BUG_ON() if you think it is a good
idea to do so.

> 
> This will require unnecessarily extending stats arrays to have 32 entries
> in order to match names arrays. If you don't feel strogly about changing
> "<" to "!=", I prefer to keep the code as it is.
Bjorn Helgaas June 3, 2022, 11:58 p.m. UTC | #4
On Fri, Jun 03, 2022 at 10:12:47PM +0000, Mohamed Khalfella wrote:
> Is there any chance for this to land in 5.19?

Too late for v5.19, since the merge window will end in a couple days.
Remind me again if you don't see it in -next by v5.20-rc5 or so.

> On 5/10/22 14:17, Mohamed Khalfella wrote:
> > > Thanks for catching this; it definitely looks like a real issue!  I
> > > guess you're probably seeing junk in the sysfs files?
> > 
> > That is correct. The initial report was seeing junk when reading sysfs
> > files. As descibed, this is happening because we reading data past the
> > end of the stats counters array.
> > 
> > 
> > > I think maybe we should populate the currently NULL entries in the
> > > string[] arrays and simplify the code here, e.g.,
> > > 
> > > static const char *aer_correctable_error_string[] = {
> > >        "RxErr",                        /* Bit Position 0       */
> > >        "dev_cor_errs_bit[1]",
> > >	...
> > >
> > >  if (stats[i])
> > >    len += sysfs_emit_at(buf, len, "%s %llu\n", strings_array[i], stats[i]);
> > 
> > Doing it this way will change the output format. In this case we will show
> > stats only if their value is greater than zero. The current code shows all the
> > stats those have names (regardless of their value) plus those have non-zero
> > values.
> > 
> > >> @@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev)
> > >>  	struct device *device = &dev->device;
> > >>  	struct pci_dev *port = dev->port;
> > >>
> > >> +	BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
> > >> +		     AER_MAX_TYPEOF_COR_ERRS);
> > >> +	BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
> > >> +		     AER_MAX_TYPEOF_UNCOR_ERRS);
> > >
> > > And make these check for "!=" instead of "<".
> 
> I am happy to remove these BUILD_BUG_ON() if you think it is a good
> idea to do so.

I think it's good to enforce correctness there somehow, so let's leave
them there unless somebody has a better idea.

> > This will require unnecessarily extending stats arrays to have 32 entries
> > in order to match names arrays. If you don't feel strogly about changing
> > "<" to "!=", I prefer to keep the code as it is.
Mohamed Khalfella June 4, 2022, 8:30 p.m. UTC | #5
On 6/3/22 16:58, Bjorn Helgaas wrote:
> On Fri, Jun 03, 2022 at 10:12:47PM +0000, Mohamed Khalfella wrote:
>> Is there any chance for this to land in 5.19?
>
> Too late for v5.19, since the merge window will end in a couple days.
> Remind me again if you don't see it in -next by v5.20-rc5 or so.
>

Thank you. I will keep an eye on -next.

>> On 5/10/22 14:17, Mohamed Khalfella wrote:
>>>> Thanks for catching this; it definitely looks like a real issue!  I
>>>> guess you're probably seeing junk in the sysfs files?
>>>
>>> That is correct. The initial report was seeing junk when reading sysfs
>>> files. As descibed, this is happening because we reading data past the
>>> end of the stats counters array.
>>>
>>>
>>>> I think maybe we should populate the currently NULL entries in the
>>>> string[] arrays and simplify the code here, e.g.,
>>>>
>>>> static const char *aer_correctable_error_string[] = {
>>>>        "RxErr",                        /* Bit Position 0       */
>>>>        "dev_cor_errs_bit[1]",
>>>> 	...
>>>>
>>>>  if (stats[i])
>>>>    len += sysfs_emit_at(buf, len, "%s %llu\n", strings_array[i], stats[i]);
>>>
>>> Doing it this way will change the output format. In this case we will show
>>> stats only if their value is greater than zero. The current code shows all the
>>> stats those have names (regardless of their value) plus those have non-zero
>>> values.
>>>
>>>>> @@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev)
>>>>>  	struct device *device = &dev->device;
>>>>>  	struct pci_dev *port = dev->port;
>>>>>
>>>>> +	BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
>>>>> +		     AER_MAX_TYPEOF_COR_ERRS);
>>>>> +	BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
>>>>> +		     AER_MAX_TYPEOF_UNCOR_ERRS);
>>>>
>>>> And make these check for "!=" instead of "<".
>>
>> I am happy to remove these BUILD_BUG_ON() if you think it is a good
>> idea to do so.
>
> I think it's good to enforce correctness there somehow, so let's leave
> them there unless somebody has a better idea.
>
>>> This will require unnecessarily extending stats arrays to have 32 entries
>>> in order to match names arrays. If you don't feel strogly about changing
>>> "<" to "!=", I prefer to keep the code as it is.
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b27..ce99a6d44786 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -533,7 +533,7 @@  static const char *aer_agent_string[] = {
 	u64 *stats = pdev->aer_stats->stats_array;			\
 	size_t len = 0;							\
 									\
-	for (i = 0; i < ARRAY_SIZE(strings_array); i++) {		\
+	for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\
 		if (strings_array[i])					\
 			len += sysfs_emit_at(buf, len, "%s %llu\n",	\
 					     strings_array[i],		\
@@ -1342,6 +1342,11 @@  static int aer_probe(struct pcie_device *dev)
 	struct device *device = &dev->device;
 	struct pci_dev *port = dev->port;
 
+	BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
+		     AER_MAX_TYPEOF_COR_ERRS);
+	BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
+		     AER_MAX_TYPEOF_UNCOR_ERRS);
+
 	/* Limit to Root Ports or Root Complex Event Collectors */
 	if ((pci_pcie_type(port) != PCI_EXP_TYPE_RC_EC) &&
 	    (pci_pcie_type(port) != PCI_EXP_TYPE_ROOT_PORT))