mbox series

[RFC,0/4] Rate limit PCIe Correctable Errors

Message ID cover.1734005191.git.karolina.stolarek@oracle.com (mailing list archive)
Headers show
Series Rate limit PCIe Correctable Errors | expand

Message

Karolina Stolarek Dec. 12, 2024, 2:27 p.m. UTC
TL;DR
====

We are getting multiple reports about excessive logging of Correctable
Errors with no clear common root cause. As these errors are already
corrected by hardware, it makes sense to limit them. Introduce
a ratelimit state definition to pci_dev to control the number of
messages reported by a Root Port within a specified time interval.
The series adds other improvements in the area, as outlined in the
Proposal section.

Introduction
============

PCIe devices are expected to detect and report errors related to
transactions they participate in. If an error is discovered,
the device creates an Error Message and sends it to the Root Port
which, in turn, generates an interrupt upon receiving it. The
AER driver services such an interrupt by logging the error and
performing recovery work if needed. This means that every time
an error is detected, potentially every transaction, it gets
reported. But not all errors are of the same significance. We can
categorize them into errors from which the hardware can recover
(Correctable), and errors that require software intervention and
impact the functionality of the interface (Uncorrectable). While
informing about the latter is essential, reports of Correctable
Errors are purely informational. Such logs can helpful in monitoring
the stability of the link but if such errors are reported too
frequently, they pollute the logs and overshadow failures that require
our attention.

Background
==========

The excessive logging of AER Correctable Errors has been reported
many times in the past, across different devices, with no obvious
common root cause: [1], [2], [3] and more. There is no easy way to
control the rate of reported errors, so the end users would go for
workarounds, such as disabling AER feature, or creating scripts to
directly disable generating messages for Correctable Errors per
device[4]. In the majority of cases, lowering the rate at which
such errors are reported, should be enough to elevate the symptoms.

Previous Work
=============

The series submitted by Grant Grundler[5] attempted to rate limit
messages by using a ratelimited variant of pci_info wrapper. Although
this change lowers the volume of specific prinkts, the rate limiting
does not work globally, and messages could appear out of sync.
The desired solution is to introduce a manually managed ratelimit_state,
which would be used every time a Correctable Error is signaled.

Proposal
========

To overcome the limitations of per-printk-call state, this series
introduces a ratelimit state definition and uses it to decide if
a registered Correctable Error should be reported. Initially, we
allow to report one error per 2 seconds, and extend that interval
to 30 seconds after reaching the threshold of 1000 Correctable Errors
on a Root Port. The values proposed here are arbitrary and up
for discussion.

If all of this still does not help, we give an option to silence
the messages by exposing a bit in Device Control Register responsible
for Correctable Error Message generation as a sysfs attribute.

The series introduces changes as follows:

  1) Use the same log level across printks when reporting an AER Error

  2) Add a ratelimit_state definition to pci_dev struct to maintain
     the ratelimit state between interrupts. Check it to decide if
     a Correctable Error can be reported

  3) Extend the interval between reported Correctable Errors after
     reaching a threshold. Make this decision based on the value of
     aer_rootport_total_err_cor counter. Add a flag to pci_dev to signal
     this event to avoid continuous updates of the ratelimit interval

  4) Add cor_errs_reporting_enable sysfs attribute to read and write
     the bit controlling the generation of Correctable Error messages
     by the device.

Patches 1-3 depend one on another, patch 4 is a standalone change.

This solution requires modification to pci_dev struct, which can be
deemed invasive. Still, this is the only place capable of maintaining
a stateful ratelimit. The other structure used by aer_isr_one_error,
a function that processes the signaled error, is populated and reused
on each serviced interrupt. The alternative solution would be to have
one global rate limit state in the aforementioned function, but this
definition could be easily abused by one noisy device. It would hit
limit in no time, leaving Correctable Errors, even coming from other
Root Ports, unreported.

Having said that, I am more than happy to discuss other ways of
getting Correctable errors under control.

Results
=======

The patches were tested with aer-inject tool as well as with actual
hardware, Samsung PM1733 NVMe. The disk would generate high volumes of
Correctable Error messages, sometimes as soon as an IRQ was allocated
for it:

[   20.662657] pcieport 0000:00:01.3: AER: Multiple Corrected error
message received from 0000:03:00.0
[   20.677779] pci 0000:03:00.0: PCIe Bus Error: severity=Corrected,
type=Physical Layer, (Receiver ID)
[   20.693003] pci 0000:03:00.0:   device [144d:a824] error
status/mask=00000001/00000000
[   20.706861] pci 0000:03:00.0:    [ 0] RxErr
[   20.718171] pcieport 0000:00:01.3: AER: Multiple Corrected error
message received from 0000:03:00.0
[   20.733303] pci 0000:03:00.0: PCIe Bus Error: severity=Corrected,
type=Physical Layer, (Receiver ID)
[   20.748545] pci 0000:03:00.0:   device [144d:a824] error
status/mask=00000001/00000000
[   20.762421] pci 0000:03:00.0:    [ 0] RxErr
[   20.773733] pcieport 0000:00:01.3: AER: Multiple Corrected error
message received from 0000:03:00.0
[   20.788874] pci 0000:03:00.0: PCIe Bus Error: severity=Corrected,
type=Physical Layer, (Receiver ID)
[   20.804130] pci 0000:03:00.0:   device [144d:a824] error
status/mask=00000001/00000000
[   20.818026] pci 0000:03:00.0:    [ 0] RxErr
[   20.829359] pcieport 0000:00:01.3: AER: Corrected error message
received from 0000:03:00.0
[   20.843615] pci 0000:03:00.0: PCIe Bus Error: severity=Corrected,
type=Physical Layer, (Receiver ID)
[   20.858868] pci 0000:03:00.0:   device [144d:a824] error
status/mask=00000001/00000000
[   20.872752] pci 0000:03:00.0:    [ 0] RxErr

With the basic rate limiting, the dmesg is still populated at the
high rate:

[  248.871792] pcieport 0000:00:01.3: Correctable error message received
from 0000:03:00.0
[  248.871805] nvme 0000:03:00.0: PCIe Bus Error: severity=Correctable,
type=Physical Layer, (Receiver ID)
[  248.871808] nvme 0000:03:00.0:   device [144d:a824] error
status/mask=00000001/00000000
[  248.871811] nvme 0000:03:00.0:    [ 0] RxErr                  (First)
[  251.037039] report_aer_cor_err: 7 callbacks suppressed
[  251.037048] pcieport 0000:00:01.3: Correctable error message received
from 0000:03:00.0
[  251.037061] nvme 0000:03:00.0: PCIe Bus Error: severity=Correctable,
type=Physical Layer, (Receiver ID)
[  251.037064] nvme 0000:03:00.0:   device [144d:a824] error
status/mask=00000001/00000000
[  251.037067] nvme 0000:03:00.0:    [ 0] RxErr                  (First)
[  253.272609] report_aer_cor_err: 5 callbacks suppressed

After the threshold of Correctable Errors is reached, the rate limit
interval is updated:

[  342.528507] pcieport 0000:00:01.3: AER: Over 1000 Correctable Errors
reported, increasing the rate limit
[  371.958275] report_aer_cor_err: 94 callbacks suppressed
[  371.958284] pcieport 0000:00:01.3: Correctable error message received
from 0000:03:00.0
[  371.958297] nvme 0000:03:00.0: PCIe Bus Error: severity=Correctable,
type=Physical Layer, (Receiver ID)
[  371.958300] nvme 0000:03:00.0:   device [144d:a824] error
status/mask=00000001/00000000
[  371.958302] nvme 0000:03:00.0:    [ 0] RxErr                  (First)
[  402.018141] report_aer_cor_err: 89 callbacks suppressed

----------------------
[1] - https://bugzilla.kernel.org/show_bug.cgi?id=215027
[2] - https://bugzilla.kernel.org/show_bug.cgi?id=201517
[3] - https://bugzilla.kernel.org/show_bug.cgi?id=196183
[4] - https://gist.github.com/flisboac/5a0711201311b63d23b292110bb383cd
[5] -
https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@chromium.org/

Karolina Stolarek (4):
  PCI/AER: Use the same log level for all messages
  PCI/AER: Add Correctable Errors rate limiting
  PCI/AER: Increase the rate limit interval after threshold
  PCI: Add 'cor_err_reporting_enable' attribute

 Documentation/ABI/testing/sysfs-bus-pci |   7 ++
 drivers/pci/pci-sysfs.c                 |  42 +++++++++
 drivers/pci/pci.h                       |   2 +-
 drivers/pci/pcie/aer.c                  | 108 ++++++++++++++++--------
 drivers/pci/pcie/dpc.c                  |   2 +-
 include/linux/pci.h                     |   2 +
 6 files changed, 127 insertions(+), 36 deletions(-)

Comments

Jonathan Cameron Dec. 16, 2024, 10:44 a.m. UTC | #1
On Thu, 12 Dec 2024 14:27:28 +0000
Karolina Stolarek <karolina.stolarek@oracle.com> wrote:

> TL;DR
> ====
> 
> We are getting multiple reports about excessive logging of Correctable
> Errors with no clear common root cause. As these errors are already
> corrected by hardware, it makes sense to limit them. Introduce
> a ratelimit state definition to pci_dev to control the number of
> messages reported by a Root Port within a specified time interval.
> The series adds other improvements in the area, as outlined in the
> Proposal section.

Hi Karolina,

Just to check, this doesn't affect tracepoints?   From a quick read
of the patches they look like they will still be triggered so monitoring
tools will see the correctable errors.  That's definitely the right
option even if we limit prints to the kernel log.

Assuming I read it right, change the series title to make it clear this
is just the prints to the kernel log that you are touching.

Thanks,

Jonathan
Karolina Stolarek Dec. 17, 2024, 8:34 a.m. UTC | #2
Hi Jonathan,

Many thanks for taking a look at the patches.

On 16/12/2024 11:44, Jonathan Cameron wrote:
> 
> Hi Karolina,
> 
> Just to check, this doesn't affect tracepoints?   From a quick read
> of the patches they look like they will still be triggered so monitoring
> tools will see the correctable errors.  That's definitely the right
> option even if we limit prints to the kernel log.

The tracing seems to be working -- rasdaemon recorded every correctable
error despite that we didn't print all of them to the kernel log.

> Assuming I read it right, change the series title to make it clear this
> is just the prints to the kernel log that you are touching.

OK, will do that in the next version.

All the best,
Karolina

> 
> Thanks,
> 
> Jonathan