diff mbox series

[v4,5/7] PCI/AER: Introduce ratelimit for error logs

Message ID 20250320082057.622983-6-pandoh@google.com (mailing list archive)
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series Rate limit AER logs | expand

Commit Message

Jon Pan-Doh March 20, 2025, 8:20 a.m. UTC
Spammy devices can flood kernel logs with AER errors and slow/stall
execution. Add per-device ratelimits for AER correctable and uncorrectable
errors that use the kernel defaults (10 per 5s).

Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors logged
while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) show
true count of 11.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git

Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Reviewed-by: Karolina Stolarek <karolina.stolarek@oracle.com>
---
 drivers/pci/pcie/aer.c | 74 +++++++++++++++++++++++++++++++++---------
 1 file changed, 58 insertions(+), 16 deletions(-)

Comments

Karolina Stolarek March 20, 2025, 2:56 p.m. UTC | #1
On 20/03/2025 09:20, Jon Pan-Doh wrote:
> Spammy devices can flood kernel logs with AER errors and slow/stall 
> execution. Add per-device ratelimits for AER correctable and 
> uncorrectable errors that use the kernel defaults (10 per 5s).
> 
> Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors 
> logged while AER stats (cat /sys/bus/pci/devices/<dev>/ 
> aer_dev_correctable) show true count of 11.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer- 
> inject.git
 >
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> Reviewed-by: Karolina Stolarek <karolina.stolarek@oracle.com>

For future reference -- please drop r-bs from patches that have 
functional/bigger changes. New code nullifies previous reviews.

> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 3069376b3553..081cef5fc678 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -88,6 +89,10 @@ struct aer_report {
>   	u64 rootport_total_cor_errs;
>   	u64 rootport_total_fatal_errs;
>   	u64 rootport_total_nonfatal_errs;
> +
> +	/* Ratelimits for errors */
> +	struct ratelimit_state cor_log_ratelimit;
> +	struct ratelimit_state uncor_log_ratelimit;
>   };
>   
>   #define AER_LOG_TLP_MASKS		(PCI_ERR_UNC_POISON_TLP|	\
> @@ -379,6 +384,15 @@ void pci_aer_init(struct pci_dev *dev)
>   
>   	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
>   
> +	/*
> +	 * Ratelimits are doubled as a given error produces 2 logs (root port
> +	 * and endpoint) that should be under same ratelimit.
> +	 */

It took me a bit to understand what this comment is about.

When we handle an error message, we first use the source's ratelimit to 
decide if we want to print the port info, and then the actual error. In 
theory, there could be more errors of the same class coming from other 
devices within one message. For these devices, we would call the 
ratelimit just once. I don't have a nice an clean solution for this 
problem, I just wanted to highlight that 1) we don't use the Root Port's 
ratelimit in aer_print_port_info(), 2) we may use the bursts to either 
print port_info + error message or just the message, in different 
combinations. I think we should reword this comment to highlight the 
fact that we don't check the ratelimit once per error, we could do it twice.

Also, I wonder -- do only Endpoints generate error messages? From what I 
understand, that some errors can be detected by intermediary devices.

> +	ratelimit_state_init(&dev->aer_report->cor_log_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
> +	ratelimit_state_init(&dev->aer_report->uncor_log_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
> +
>   	/*
>   	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
>   	 * PCI_ERR_COR_MASK, and PCI_ERR_CAP.  Root and Root Complex Event
> @@ -668,6 +682,17 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>   	}
>   }
>   
> +static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)

I really like this solution, it's nice and tidy


>   static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
>   {

I'm also thinking -- we are ratelimiting the aer_print_port_info() and 
aer_print_error(). What about the messages in dpc_process_error()? 
Should we check early if DPC was triggered because of an uncorrectable 
error, and if so, ratelimit that?

All the best,
Karolina
Bjorn Helgaas March 20, 2025, 5:51 p.m. UTC | #2
On Thu, Mar 20, 2025 at 03:56:53PM +0100, Karolina Stolarek wrote:
> On 20/03/2025 09:20, Jon Pan-Doh wrote:
> > Spammy devices can flood kernel logs with AER errors and slow/stall
> > execution. Add per-device ratelimits for AER correctable and
> > uncorrectable errors that use the kernel defaults (10 per 5s).

> > +	/*
> > +	 * Ratelimits are doubled as a given error produces 2 logs (root port
> > +	 * and endpoint) that should be under same ratelimit.
> > +	 */
> 
> It took me a bit to understand what this comment is about.
> 
> When we handle an error message, we first use the source's ratelimit to
> decide if we want to print the port info, and then the actual error. In
> theory, there could be more errors of the same class coming from other
> devices within one message.

I think this refers to the fact that when we get an AER interrupt from
a Root Port, the RP has a single Requester ID logged in the Source
Identification, but if Multiple ERR_* is set, find_device_iter() may
collect error info from several devices?

> For these devices, we would call the ratelimit
> just once. I don't have a nice an clean solution for this problem, I just
> wanted to highlight that 1) we don't use the Root Port's ratelimit in
> aer_print_port_info(), 2) we may use the bursts to either print port_info +
> error message or just the message, in different combinations. I think we
> should reword this comment to highlight the fact that we don't check the
> ratelimit once per error, we could do it twice.

Very good point.  aer_print_rp_info() is always ratelimited based on
the ERR_* Source Identification, but if Multiple ERR_* is set,
aer_print_error() ratelimits based on whatever downstream device we
found that had any error of the same class logged.

E.g., if we had something like this topology:

  00:1c.0 Root Port to [bus 01-04]
  01:00.0 Switch Upstream Port to [bus 02-04]
  02:00.0 Switch Downstream Port to [bus 03]
  02:00.1 Switch Downstream Port to [bus 04]
  03:00.0 NIC
  04:00.0 NVMe

where 03:00.0 and 04:00.0 both logged ERR_FATAL, and the Root Port
received the 03:00.0 message first, 03:00.0 would be logged as the
Source Identification, and Multiple ERR_FATAL Received should be set.
The messages related to 00:1c.0 and 03:00.0 would be ratelimited based
on 03:00.0, but aer_print_error() messages related to 04:00.0 would be
ratelimited based on 04:00.0.

That does seem like a problem.  I would propose that we always
ratelimit using the device from Source Identification. I think that's
available in aer_print_error(); we would just use info->id instead of
"dev".

That does mean we'd have to figure out the pci_dev corresponding to
the Requester ID in info->id.  Maybe we could add an
aer_err_info.src_dev pointer, and fill it in when find_device_iter()
finds the right device?

> Also, I wonder -- do only Endpoints generate error messages? From what I
> understand, that some errors can be detected by intermediary devices.

Yes, I think any device, including switches between a Root Port and
Endpoint, can detect errors and generate error messages.

I guess this means the "endpoint" variable in aer_print_port_info() is
probably too specific.  IIUC the aer_print_port_info() "dev" parameter
is always a Root Port since it came from aer_rpc.rpd.  Naming it "rp"
would make this more obvious and free up "dev" for the source device.
And "aer_print_port_info" itself could be more descriptive, e.g.,
"aer_print_rp_info()", since *every* PCIe device has a Port.

> I'm also thinking -- we are ratelimiting the aer_print_port_info() and
> aer_print_error(). What about the messages in dpc_process_error()? Should we
> check early if DPC was triggered because of an uncorrectable error, and if
> so, ratelimit that?

Good question.  It does seem like the dpc_process_error() messages
should be similarly ratelimited.  I think we currently only enable DPC
for fatal errors, and the Downstream Port takes the link down, which
resets the hierarchy below.  So (1) we probably won't see storms of
fatal error messages, and (2) it looks like we might not print any
error info from downstream devices, since they're not reachable while
the link is down.

It does seem like we *should* try to print that info after the link
comes back up, since the log registers are sticky and should survive
the reset.  Maybe we do that already and I just missed it.

This seems like something we could put off a little bit while we deal
with the AER correctable error issue.

Bjorn
Jon Pan-Doh March 20, 2025, 7:37 p.m. UTC | #3
On Thu, Mar 20, 2025 at 7:57 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> For future reference -- please drop r-bs from patches that have
> functional/bigger changes. New code nullifies previous reviews.

Ack. Will remove in v5.
Jon Pan-Doh March 20, 2025, 7:53 p.m. UTC | #4
On Thu, Mar 20, 2025 at 10:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Mar 20, 2025 at 03:56:53PM +0100, Karolina Stolarek wrote:
> > On 20/03/2025 09:20, Jon Pan-Doh wrote:
> > > +   /*
> > > +    * Ratelimits are doubled as a given error produces 2 logs (root port
> > > +    * and endpoint) that should be under same ratelimit.
> > > +    */
> > For these devices, we would call the ratelimit
> > just once. I don't have a nice an clean solution for this problem, I just
> > wanted to highlight that 1) we don't use the Root Port's ratelimit in
> > aer_print_port_info(), 2) we may use the bursts to either print port_info +
> > error message or just the message, in different combinations. I think we
> > should reword this comment to highlight the fact that we don't check the
> > ratelimit once per error, we could do it twice.

You're right. I was thinking of amending it to something like:

Ratelimits are doubled as a given error notification produces up to 2 logs
(1 at root port and 1 at source device) that should be under the same ratelimit.

> Very good point.  aer_print_rp_info() is always ratelimited based on
> the ERR_* Source Identification, but if Multiple ERR_* is set,
> aer_print_error() ratelimits based on whatever downstream device we
> found that had any error of the same class logged.
>
> That does seem like a problem.  I would propose that we always
> ratelimit using the device from Source Identification. I think that's
> available in aer_print_error(); we would just use info->id instead of
> "dev".

Wouldn't you be incorrectly counting the non-source ID devices then?
I think this is another reason where removing aer_print_port_info()[1] (only
printing port info when failing to get device error info) simplifies things. Of
course, we then have to weigh whether the loss of info is less than the
ratelimit complexity.

> > I'm also thinking -- we are ratelimiting the aer_print_port_info() and
> > aer_print_error(). What about the messages in dpc_process_error()? Should we
> > check early if DPC was triggered because of an uncorrectable error, and if
> > so, ratelimit that?
>
> Good question.  It does seem like the dpc_process_error() messages
> should be similarly ratelimited.  I think we currently only enable DPC
> for fatal errors, and the Downstream Port takes the link down, which
> resets the hierarchy below.  So (1) we probably won't see storms of
> fatal error messages, and (2) it looks like we might not print any
> error info from downstream devices, since they're not reachable while
> the link is down.

I did not expect error storms from DPC so I thought it best to focus on AER.

[1] https://lore.kernel.org/linux-pci/CAMC_AXWVOtKh2r4kX6c7jtJwQaEE4KEQsH=uoB1OhczJ=8K2VA@mail.gmail.com/

Thanks,
Jon





On Thu, Mar 20, 2025 at 10:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Mar 20, 2025 at 03:56:53PM +0100, Karolina Stolarek wrote:
> > On 20/03/2025 09:20, Jon Pan-Doh wrote:
> > > Spammy devices can flood kernel logs with AER errors and slow/stall
> > > execution. Add per-device ratelimits for AER correctable and
> > > uncorrectable errors that use the kernel defaults (10 per 5s).
>
> > > +   /*
> > > +    * Ratelimits are doubled as a given error produces 2 logs (root port
> > > +    * and endpoint) that should be under same ratelimit.
> > > +    */
> >
> > It took me a bit to understand what this comment is about.
> >
> > When we handle an error message, we first use the source's ratelimit to
> > decide if we want to print the port info, and then the actual error. In
> > theory, there could be more errors of the same class coming from other
> > devices within one message.
>
> I think this refers to the fact that when we get an AER interrupt from
> a Root Port, the RP has a single Requester ID logged in the Source
> Identification, but if Multiple ERR_* is set, find_device_iter() may
> collect error info from several devices?
>
> > For these devices, we would call the ratelimit
> > just once. I don't have a nice an clean solution for this problem, I just
> > wanted to highlight that 1) we don't use the Root Port's ratelimit in
> > aer_print_port_info(), 2) we may use the bursts to either print port_info +
> > error message or just the message, in different combinations. I think we
> > should reword this comment to highlight the fact that we don't check the
> > ratelimit once per error, we could do it twice.
>
> Very good point.  aer_print_rp_info() is always ratelimited based on
> the ERR_* Source Identification, but if Multiple ERR_* is set,
> aer_print_error() ratelimits based on whatever downstream device we
> found that had any error of the same class logged.
>
> E.g., if we had something like this topology:
>
>   00:1c.0 Root Port to [bus 01-04]
>   01:00.0 Switch Upstream Port to [bus 02-04]
>   02:00.0 Switch Downstream Port to [bus 03]
>   02:00.1 Switch Downstream Port to [bus 04]
>   03:00.0 NIC
>   04:00.0 NVMe
>
> where 03:00.0 and 04:00.0 both logged ERR_FATAL, and the Root Port
> received the 03:00.0 message first, 03:00.0 would be logged as the
> Source Identification, and Multiple ERR_FATAL Received should be set.
> The messages related to 00:1c.0 and 03:00.0 would be ratelimited based
> on 03:00.0, but aer_print_error() messages related to 04:00.0 would be
> ratelimited based on 04:00.0.
>
> That does seem like a problem.  I would propose that we always
> ratelimit using the device from Source Identification. I think that's
> available in aer_print_error(); we would just use info->id instead of
> "dev".
>
> That does mean we'd have to figure out the pci_dev corresponding to
> the Requester ID in info->id.  Maybe we could add an
> aer_err_info.src_dev pointer, and fill it in when find_device_iter()
> finds the right device?
>
> > Also, I wonder -- do only Endpoints generate error messages? From what I
> > understand, that some errors can be detected by intermediary devices.
>
> Yes, I think any device, including switches between a Root Port and
> Endpoint, can detect errors and generate error messages.
>
> I guess this means the "endpoint" variable in aer_print_port_info() is
> probably too specific.  IIUC the aer_print_port_info() "dev" parameter
> is always a Root Port since it came from aer_rpc.rpd.  Naming it "rp"
> would make this more obvious and free up "dev" for the source device.
> And "aer_print_port_info" itself could be more descriptive, e.g.,
> "aer_print_rp_info()", since *every* PCIe device has a Port.
>
> > I'm also thinking -- we are ratelimiting the aer_print_port_info() and
> > aer_print_error(). What about the messages in dpc_process_error()? Should we
> > check early if DPC was triggered because of an uncorrectable error, and if
> > so, ratelimit that?
>
> Good question.  It does seem like the dpc_process_error() messages
> should be similarly ratelimited.  I think we currently only enable DPC
> for fatal errors, and the Downstream Port takes the link down, which
> resets the hierarchy below.  So (1) we probably won't see storms of
> fatal error messages, and (2) it looks like we might not print any
> error info from downstream devices, since they're not reachable while
> the link is down.
>
> It does seem like we *should* try to print that info after the link
> comes back up, since the log registers are sticky and should survive
> the reset.  Maybe we do that already and I just missed it.
>
> This seems like something we could put off a little bit while we deal
> with the AER correctable error issue.
>
> Bjorn
Bjorn Helgaas March 20, 2025, 8:29 p.m. UTC | #5
On Thu, Mar 20, 2025 at 12:53:53PM -0700, Jon Pan-Doh wrote:
> On Thu, Mar 20, 2025 at 10:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Mar 20, 2025 at 03:56:53PM +0100, Karolina Stolarek wrote:
> > > On 20/03/2025 09:20, Jon Pan-Doh wrote:
> > > > +   /*
> > > > +    * Ratelimits are doubled as a given error produces 2 logs (root port
> > > > +    * and endpoint) that should be under same ratelimit.
> > > > +    */
> > > For these devices, we would call the ratelimit just once. I
> > > don't have a nice an clean solution for this problem, I just
> > > wanted to highlight that 1) we don't use the Root Port's
> > > ratelimit in aer_print_port_info(), 2) we may use the bursts to
> > > either print port_info + error message or just the message, in
> > > different combinations. I think we should reword this comment to
> > > highlight the fact that we don't check the ratelimit once per
> > > error, we could do it twice.
> 
> You're right. I was thinking of amending it to something like:
> 
> Ratelimits are doubled as a given error notification produces up to
> 2 logs (1 at root port and 1 at source device) that should be under
> the same ratelimit.
> 
> > Very good point.  aer_print_rp_info() is always ratelimited based
> > on the ERR_* Source Identification, but if Multiple ERR_* is set,
> > aer_print_error() ratelimits based on whatever downstream device
> > we found that had any error of the same class logged.
> >
> > That does seem like a problem.  I would propose that we always
> > ratelimit using the device from Source Identification. I think
> > that's available in aer_print_error(); we would just use info->id
> > instead of "dev".
> 
> Wouldn't you be incorrectly counting the non-source ID devices then?
> I think this is another reason where removing
> aer_print_port_info()[1] (only printing port info when failing to
> get device error info) simplifies things. Of course, we then have to
> weigh whether the loss of info is less than the ratelimit
> complexity.

Yes, I guess so.  Maybe the ratelimit should be in the source of the
interrupt (Root Port for AER, Root Port or Downstream Port for DPC) so
it's more directly related to the interrupt that got us here in the
first place.

I think the struct aer_err_info is basically a per-interrupt thing, so
maybe we could evaluate __ratelimit() once at the initial entry, save
the result in aer_err_info, and use that saved value everywhere we
print messages?

  - native AER: aer_isr_one_error() has RP pointer in rpc->rpd and
    could save it (or pointer to the RP's ratelimit struct, or just
    the result of __ratelimit()) in aer_err_info.

  - GHES AER: I'm not sure struct cper_sec_pcie contains the RP, might
    have to search upwards from the device we know about?

  - native DPC: dpc_process_error() has DP pointer and could save it
    in aer_err_info.

  - EDR DPC: passes DP pointer to dpc_process_error().

> > > I'm also thinking -- we are ratelimiting the
> > > aer_print_port_info() and aer_print_error(). What about the
> > > messages in dpc_process_error()? Should we check early if DPC
> > > was triggered because of an uncorrectable error, and if so,
> > > ratelimit that?
> >
> > Good question.  It does seem like the dpc_process_error() messages
> > should be similarly ratelimited.  I think we currently only enable
> > DPC for fatal errors, and the Downstream Port takes the link down,
> > which resets the hierarchy below.  So (1) we probably won't see
> > storms of fatal error messages, and (2) it looks like we might not
> > print any error info from downstream devices, since they're not
> > reachable while the link is down.
> 
> I did not expect error storms from DPC so I thought it best to focus
> on AER.

Completely agreed.

> [1] https://lore.kernel.org/linux-pci/CAMC_AXWVOtKh2r4kX6c7jtJwQaEE4KEQsH=uoB1OhczJ=8K2VA@mail.gmail.com/

[BTW, email quoting style error here; shouldn't have the entire
message you're replying to repeated below.  Gmail tries hard to screw
this up for you ;)]

> On Thu, Mar 20, 2025 at 10:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> ... (snipped)
Sathyanarayanan Kuppuswamy March 21, 2025, 1 a.m. UTC | #6
Hi,

On 3/20/25 1:20 AM, Jon Pan-Doh wrote:
> Spammy devices can flood kernel logs with AER errors and slow/stall
> execution. Add per-device ratelimits for AER correctable and uncorrectable
> errors that use the kernel defaults (10 per 5s).

Should we exclude fatal errors from the rate limit? Fatal error logs 
would be
really useful for debug analysis, and they not happen very frequently.

>
> Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors logged
> while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) show
> true count of 11.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> Reviewed-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> ---
>   drivers/pci/pcie/aer.c | 74 +++++++++++++++++++++++++++++++++---------
>   1 file changed, 58 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 3069376b3553..081cef5fc678 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -28,6 +28,7 @@
>   #include <linux/interrupt.h>
>   #include <linux/delay.h>
>   #include <linux/kfifo.h>
> +#include <linux/ratelimit.h>
>   #include <linux/slab.h>
>   #include <acpi/apei.h>
>   #include <acpi/ghes.h>
> @@ -88,6 +89,10 @@ struct aer_report {
>   	u64 rootport_total_cor_errs;
>   	u64 rootport_total_fatal_errs;
>   	u64 rootport_total_nonfatal_errs;
> +
> +	/* Ratelimits for errors */
> +	struct ratelimit_state cor_log_ratelimit;
> +	struct ratelimit_state uncor_log_ratelimit;
>   };
>   
>   #define AER_LOG_TLP_MASKS		(PCI_ERR_UNC_POISON_TLP|	\
> @@ -379,6 +384,15 @@ void pci_aer_init(struct pci_dev *dev)
>   
>   	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
>   
> +	/*
> +	 * Ratelimits are doubled as a given error produces 2 logs (root port
> +	 * and endpoint) that should be under same ratelimit.
> +	 */
> +	ratelimit_state_init(&dev->aer_report->cor_log_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
> +	ratelimit_state_init(&dev->aer_report->uncor_log_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
> +
>   	/*
>   	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
>   	 * PCI_ERR_COR_MASK, and PCI_ERR_CAP.  Root and Root Complex Event
> @@ -668,6 +682,17 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>   	}
>   }
>   
> +static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
> +{
> +	struct ratelimit_state *ratelimit;
> +
> +	if (severity == AER_CORRECTABLE)
> +		ratelimit = &dev->aer_report->cor_log_ratelimit;
> +	else
> +		ratelimit = &dev->aer_report->uncor_log_ratelimit;
> +	return __ratelimit(ratelimit);
> +}
> +
>   static void __aer_print_error(struct pci_dev *dev,
>   			      struct aer_err_info *info,
>   			      const char *level)
> @@ -698,6 +723,12 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
>   	int layer, agent;
>   	int id = pci_dev_id(dev);
>   
> +	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
> +			info->severity, info->tlp_header_valid, &info->tlp);
> +
> +	if (!aer_ratelimit(dev, info->severity))
> +		return;
> +
>   	if (!info->status) {
>   		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
>   			aer_error_severity_string[info->severity]);
> @@ -722,21 +753,28 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
>   out:
>   	if (info->id && info->error_dev_num > 1 && info->id == id)
>   		pci_err(dev, "  Error of this Agent is reported first\n");
> -
> -	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
> -			info->severity, info->tlp_header_valid, &info->tlp);
>   }
>   
>   static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
>   {
>   	u8 bus = info->id >> 8;
>   	u8 devfn = info->id & 0xff;
> +	struct pci_dev *endpoint;
> +	int i;
> +
> +	/* extract endpoint device ratelimit */
> +	for (i = 0; i < info->error_dev_num; i++) {
> +		endpoint = info->dev[i];
> +		if (info->id == pci_dev_id(endpoint))
> +			break;
> +	}
>   
> -	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
> -		 info->multi_error_valid ? "Multiple " : "",
> -		 aer_error_severity_string[info->severity],
> -		 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> -		 PCI_FUNC(devfn));
> +	if (aer_ratelimit(endpoint, info->severity))
> +		pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
> +			 info->multi_error_valid ? "Multiple " : "",
> +			 aer_error_severity_string[info->severity],
> +			 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> +			 PCI_FUNC(devfn));
>   }
>   
>   #ifdef CONFIG_ACPI_APEI_PCIEAER
> @@ -784,6 +822,12 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   
>   	pci_dev_aer_stats_incr(dev, &info);
>   
> +	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> +			aer_severity, tlp_header_valid, &aer->header_log);
> +
> +	if (!aer_ratelimit(dev, aer_severity))
> +		return;
> +
>   	aer_printk(level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
>   	__aer_print_error(dev, &info, level);
>   	aer_printk(level, dev, "aer_layer=%s, aer_agent=%s\n",
> @@ -795,9 +839,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   
>   	if (tlp_header_valid)
>   		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
> -
> -	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> -			aer_severity, tlp_header_valid, &aer->header_log);
>   }
>   EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
>   
> @@ -1299,10 +1340,11 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>   			e_info.multi_error_valid = 1;
>   		else
>   			e_info.multi_error_valid = 0;
> -		aer_print_port_info(pdev, &e_info);
>   
> -		if (find_source_device(pdev, &e_info))
> +		if (find_source_device(pdev, &e_info)) {
> +			aer_print_port_info(pdev, &e_info);
>   			aer_process_err_devices(&e_info, KERN_WARNING);
> +		}
>   	}
>   
>   	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
> @@ -1318,10 +1360,10 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>   		else
>   			e_info.multi_error_valid = 0;
>   
> -		aer_print_port_info(pdev, &e_info);
> -
> -		if (find_source_device(pdev, &e_info))
> +		if (find_source_device(pdev, &e_info)) {
> +			aer_print_port_info(pdev, &e_info);
>   			aer_process_err_devices(&e_info, KERN_ERR);
> +		}
>   	}
>   }
>
Jon Pan-Doh March 21, 2025, 1:58 a.m. UTC | #7
On Thu, Mar 20, 2025 at 1:29 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Mar 20, 2025 at 12:53:53PM -0700, Jon Pan-Doh wrote:
> I think the struct aer_err_info is basically a per-interrupt thing, so
> maybe we could evaluate __ratelimit() once at the initial entry, save
> the result in aer_err_info, and use that saved value everywhere we
> print messages?

I like this approach. Another advantage is it removes the need for the 2x
ratelimit logic. Updated for v5.

>   - native AER: aer_isr_one_error() has RP pointer in rpc->rpd and
>     could save it (or pointer to the RP's ratelimit struct, or just
>     the result of __ratelimit()) in aer_err_info.

Similar to aer_err_info.dev[], I store the evaluated __ratelimit() in
aer_err_info.ratelimited[]. The main quirk is that for multiple
errors, you won't
see the root port log if the first error is ratelimited, but the
subsequent errors
are under the limit. I think this is fine, as the log prints out the
first error only,
but can change aer_print_port_info() to log if any of the errors is
under the limit.

>   - GHES AER: I'm not sure struct cper_sec_pcie contains the RP, might
>     have to search upwards from the device we know about?
>
>   - native DPC: dpc_process_error() has DP pointer and could save it
>     in aer_err_info.
>
>   - EDR DPC: passes DP pointer to dpc_process_error().

These are largely unchanged:
- GHES/CXL gated by aer_ratelimit() in pci_print_aer()
- DPC not ratelimited with the expectation that there won't be error storms

Thanks,
Jon
Jon Pan-Doh March 21, 2025, 7:24 p.m. UTC | #8
On Thu, Mar 20, 2025 at 6:00 PM Sathyanarayanan Kuppuswamy
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> Should we exclude fatal errors from the rate limit? Fatal error logs
> would be
> really useful for debug analysis, and they not happen very frequently.

The logs today only make the distinction between correctable vs.
uncorrectable so I thought it made sense to be consistent.

Maybe this is something that could be deferred? The only fixed
component is the sysfs attribute names (which can be made to refer to
uncorrectable nonfatal vs. uncorrectable in doc/underlying
implementation).

Thanks,
Jon
Sathyanarayanan Kuppuswamy March 21, 2025, 9:47 p.m. UTC | #9
Hi Jon,

On 3/21/25 12:24 PM, Jon Pan-Doh wrote:
> On Thu, Mar 20, 2025 at 6:00 PM Sathyanarayanan Kuppuswamy
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>> Should we exclude fatal errors from the rate limit? Fatal error logs
>> would be
>> really useful for debug analysis, and they not happen very frequently.
> The logs today only make the distinction between correctable vs.
> uncorrectable so I thought it made sense to be consistent.

You're right. From a logging perspective, the current driver only
differentiates between correctable and uncorrectable errors. However,
the goal of your patch series is to reduce the spam of frequent errors.
While we are rate-limiting these frequent logs, we must ensure that we
don't miss important logs. I believe we did not rate-limit DPC logs for
this very reason.


>
> Maybe this is something that could be deferred? The only fixed

I am fine with deferring. IIUC, if needed, through sysfs user can
skip rate-limit for uncorrectable errors, right?

But, is the required change to do this complex? Won't skipping the
rate limit check for fatal errors solve the problem?

Bjorn, any comments? Do you think Fatal errors should be
rate-limited?

> component is the sysfs attribute names (which can be made to refer to
> uncorrectable nonfatal vs. uncorrectable in doc/underlying
> implementation).
>
> Thanks,
> Jon
Bjorn Helgaas March 21, 2025, 9:59 p.m. UTC | #10
On Fri, Mar 21, 2025 at 02:47:36PM -0700, Sathyanarayanan Kuppuswamy wrote:
> On 3/21/25 12:24 PM, Jon Pan-Doh wrote:
> > On Thu, Mar 20, 2025 at 6:00 PM Sathyanarayanan Kuppuswamy
> > <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> > > Should we exclude fatal errors from the rate limit? Fatal error
> > > logs would be really useful for debug analysis, and they not
> > > happen very frequently.
>
> > The logs today only make the distinction between correctable vs.
> > uncorrectable so I thought it made sense to be consistent.
> 
> You're right. From a logging perspective, the current driver only
> differentiates between correctable and uncorrectable errors.
> However, the goal of your patch series is to reduce the spam of
> frequent errors.  While we are rate-limiting these frequent logs, we
> must ensure that we don't miss important logs. I believe we did not
> rate-limit DPC logs for this very reason.
> 
> > Maybe this is something that could be deferred? The only fixed
> 
> I am fine with deferring. IIUC, if needed, through sysfs user can
> skip rate-limit for uncorrectable errors, right?
> 
> But, is the required change to do this complex? Won't skipping the
> rate limit check for fatal errors solve the problem?
> 
> Bjorn, any comments? Do you think Fatal errors should be
> rate-limited?

I'm inclined to not ratelimit fatal errors unless we've seen issues
with a flood of them.

Bjorn
Jon Pan-Doh March 21, 2025, 10:11 p.m. UTC | #11
On Fri, Mar 21, 2025 at 2:47 PM Sathyanarayanan Kuppuswamy
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> I am fine with deferring. IIUC, if needed, through sysfs user can
> skip rate-limit for uncorrectable errors, right?

Yes.

> But, is the required change to do this complex? Won't skipping the
> rate limit check for fatal errors solve the problem?

No, it's not complex. I was trying to minimize extra changes in an
attempt to get this in for 6.15. However, we're right against the
merge window so that may not be possible.

Thanks,
Jon
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 3069376b3553..081cef5fc678 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -28,6 +28,7 @@ 
 #include <linux/interrupt.h>
 #include <linux/delay.h>
 #include <linux/kfifo.h>
+#include <linux/ratelimit.h>
 #include <linux/slab.h>
 #include <acpi/apei.h>
 #include <acpi/ghes.h>
@@ -88,6 +89,10 @@  struct aer_report {
 	u64 rootport_total_cor_errs;
 	u64 rootport_total_fatal_errs;
 	u64 rootport_total_nonfatal_errs;
+
+	/* Ratelimits for errors */
+	struct ratelimit_state cor_log_ratelimit;
+	struct ratelimit_state uncor_log_ratelimit;
 };
 
 #define AER_LOG_TLP_MASKS		(PCI_ERR_UNC_POISON_TLP|	\
@@ -379,6 +384,15 @@  void pci_aer_init(struct pci_dev *dev)
 
 	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
 
+	/*
+	 * Ratelimits are doubled as a given error produces 2 logs (root port
+	 * and endpoint) that should be under same ratelimit.
+	 */
+	ratelimit_state_init(&dev->aer_report->cor_log_ratelimit,
+			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
+	ratelimit_state_init(&dev->aer_report->uncor_log_ratelimit,
+			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
+
 	/*
 	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
 	 * PCI_ERR_COR_MASK, and PCI_ERR_CAP.  Root and Root Complex Event
@@ -668,6 +682,17 @@  static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
 	}
 }
 
+static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
+{
+	struct ratelimit_state *ratelimit;
+
+	if (severity == AER_CORRECTABLE)
+		ratelimit = &dev->aer_report->cor_log_ratelimit;
+	else
+		ratelimit = &dev->aer_report->uncor_log_ratelimit;
+	return __ratelimit(ratelimit);
+}
+
 static void __aer_print_error(struct pci_dev *dev,
 			      struct aer_err_info *info,
 			      const char *level)
@@ -698,6 +723,12 @@  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
 	int layer, agent;
 	int id = pci_dev_id(dev);
 
+	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
+			info->severity, info->tlp_header_valid, &info->tlp);
+
+	if (!aer_ratelimit(dev, info->severity))
+		return;
+
 	if (!info->status) {
 		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
 			aer_error_severity_string[info->severity]);
@@ -722,21 +753,28 @@  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
 out:
 	if (info->id && info->error_dev_num > 1 && info->id == id)
 		pci_err(dev, "  Error of this Agent is reported first\n");
-
-	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
-			info->severity, info->tlp_header_valid, &info->tlp);
 }
 
 static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
 {
 	u8 bus = info->id >> 8;
 	u8 devfn = info->id & 0xff;
+	struct pci_dev *endpoint;
+	int i;
+
+	/* extract endpoint device ratelimit */
+	for (i = 0; i < info->error_dev_num; i++) {
+		endpoint = info->dev[i];
+		if (info->id == pci_dev_id(endpoint))
+			break;
+	}
 
-	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
-		 info->multi_error_valid ? "Multiple " : "",
-		 aer_error_severity_string[info->severity],
-		 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
-		 PCI_FUNC(devfn));
+	if (aer_ratelimit(endpoint, info->severity))
+		pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
+			 info->multi_error_valid ? "Multiple " : "",
+			 aer_error_severity_string[info->severity],
+			 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
+			 PCI_FUNC(devfn));
 }
 
 #ifdef CONFIG_ACPI_APEI_PCIEAER
@@ -784,6 +822,12 @@  void pci_print_aer(struct pci_dev *dev, int aer_severity,
 
 	pci_dev_aer_stats_incr(dev, &info);
 
+	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
+			aer_severity, tlp_header_valid, &aer->header_log);
+
+	if (!aer_ratelimit(dev, aer_severity))
+		return;
+
 	aer_printk(level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
 	__aer_print_error(dev, &info, level);
 	aer_printk(level, dev, "aer_layer=%s, aer_agent=%s\n",
@@ -795,9 +839,6 @@  void pci_print_aer(struct pci_dev *dev, int aer_severity,
 
 	if (tlp_header_valid)
 		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
-
-	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
-			aer_severity, tlp_header_valid, &aer->header_log);
 }
 EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
 
@@ -1299,10 +1340,11 @@  static void aer_isr_one_error(struct aer_rpc *rpc,
 			e_info.multi_error_valid = 1;
 		else
 			e_info.multi_error_valid = 0;
-		aer_print_port_info(pdev, &e_info);
 
-		if (find_source_device(pdev, &e_info))
+		if (find_source_device(pdev, &e_info)) {
+			aer_print_port_info(pdev, &e_info);
 			aer_process_err_devices(&e_info, KERN_WARNING);
+		}
 	}
 
 	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
@@ -1318,10 +1360,10 @@  static void aer_isr_one_error(struct aer_rpc *rpc,
 		else
 			e_info.multi_error_valid = 0;
 
-		aer_print_port_info(pdev, &e_info);
-
-		if (find_source_device(pdev, &e_info))
+		if (find_source_device(pdev, &e_info)) {
+			aer_print_port_info(pdev, &e_info);
 			aer_process_err_devices(&e_info, KERN_ERR);
+		}
 	}
 }