[5/8] PCI/AER: Introduce ratelimit for AER IRQs

Message ID	20250115074301.3514927-6-pandoh@google.com (mailing list archive)
State	Superseded
Delegated to:	Bjorn Helgaas
Headers	show Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1210222F84F for <linux-pci@vger.kernel.org>; Wed, 15 Jan 2025 07:43:28 +0000 (UTC) Date: Tue, 14 Jan 2025 23:42:57 -0800 In-Reply-To: <20250115074301.3514927-1-pandoh@google.com> Precedence: bulk Mime-Version: 1.0 References: <20250115074301.3514927-1-pandoh@google.com> Message-ID: <20250115074301.3514927-6-pandoh@google.com> Subject: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs From: Jon Pan-Doh <pandoh@google.com> To: Bjorn Helgaas <bhelgaas@google.com>, Karolina Stolarek <karolina.stolarek@oracle.com> Cc: linux-pci@vger.kernel.org, Martin Petersen <martin.petersen@oracle.com>, Ben Fuller <ben.fuller@oracle.com>, Drew Walton <drewwalton@microsoft.com>, Anil Agrawal <anilagrawal@meta.com>, Tony Luck <tony.luck@intel.com>, Jon Pan-Doh <pandoh@google.com> Content-Type: text/plain; charset="UTF-8"
Series	Rate limit AER logs/IRQs \| expand [0/8] Rate limit AER logs/IRQs [1/8] PCI/AER: Remove aer_print_port_info [2/8] PCI/AER: Move AER stat collection out of __aer_print_error [3/8] PCI/AER: Rename struct aer_stats to aer_info [4/8] PCI/AER: Introduce ratelimit for error logs [5/8] PCI/AER: Introduce ratelimit for AER IRQs [6/8] PCI/AER: Add AER sysfs attributes for ratelimits [7/8] PCI/AER: Update AER sysfs ABI filename [8/8] PCI/AER: Move AER sysfs attributes into separate directory

Jon Pan-Doh Jan. 15, 2025, 7:42 a.m. UTC

After ratelimiting logs, spammy devices can still slow execution by
continued AER IRQ servicing.

Add higher per-device ratelimits for AER errors to mask out those IRQs.
Set the default rate to 3x default AER ratelimit (30 per 5s).

Tested using aer-inject[1] tool. Injected 32 AER errors. Observed IRQ
masked via lspci and sysfs counters record 31 errors (1 masked).

Before: CEMsk: BadTLP-
After:  CEMsk: BadTLP+

[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git

Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
 Documentation/PCI/pcieaer-howto.rst |  4 +-
 drivers/pci/pcie/aer.c              | 64 +++++++++++++++++++++++++----
 2 files changed, 57 insertions(+), 11 deletions(-)

Karolina Stolarek Jan. 16, 2025, 12:02 p.m. UTC | #1

On 15/01/2025 08:42, Jon Pan-Doh wrote:
> (...)
 > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
 > index 025c50b0f293..1db70ae87f52 100644
 > --- a/drivers/pci/pcie/aer.c
 > +++ b/drivers/pci/pcie/aer.c
> @@ -676,6 +682,39 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>   	}
>   }
>   
> +static void mask_reported_error(struct pci_dev *dev, struct aer_err_info *info)
> +{
> +	const char **strings;
> +	const char *errmsg;
> +	u16 aer_offset = dev->aer_cap;
> +	u16 mask_reg_offset;
> +	u32 mask;
> +	unsigned long status = info->status;
> +	int i;
> +
> +	if (info->severity == AER_CORRECTABLE) {
> +		strings = aer_correctable_error_string;
> +		mask_reg_offset = PCI_ERR_COR_MASK;
> +	} else {
> +		strings = aer_uncorrectable_error_string;
> +		mask_reg_offset = PCI_ERR_UNCOR_MASK;
> +	}
> +
> +	pci_read_config_dword(dev, aer_offset + mask_reg_offset, &mask);
> +	mask |= status;
> +	pci_write_config_dword(dev, aer_offset + mask_reg_offset, mask);
> +
> +	pci_warn(dev, "%s error(s) masked due to rate-limiting:",
> +		 aer_error_severity_string[info->severity]);
> +	for_each_set_bit(i, &status, 32) {
> +		errmsg = strings[i];
> +		if (!errmsg)
> +			errmsg = "Unknown Error Bit";
> +
> +		pci_warn(dev, "   [%2d] %-22s\n", i, errmsg);
> +	}
> +}
> +

My approach was more heavy-handed :)

To confirm that I understand the flow -- when we're processing
aer_err_info, that potentially carries a couple of errors, and we hit a
ratelimit, we mask the error bits in Error Status Register and print
a warning. After doing so, we won't see these types of errors reported
again. What if some time passes (let's say, 2 mins), and we hit a
condition that would normally generate an error but it's now masked? Are
we fine with missing it? I think we should be informed about
Uncorrectable errors as much as possible, as they indicate Link
integrity issues.

> -	if (!__ratelimit(ratelimit))
> +	if (!__ratelimit(irq_ratelimit)) {
> +		mask_reported_error(dev, info);
> +		return;
> +	}
> +	if (!__ratelimit(log_ratelimit))
>   		return;

Nit: add a line between these two ifs

All the best,
Karolina

Jon Pan-Doh Jan. 18, 2025, 1:58 a.m. UTC | #2

On Thu, Jan 16, 2025 at 4:02 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> To confirm that I understand the flow -- when we're processing
> aer_err_info, that potentially carries a couple of errors, and we hit a
> ratelimit, we mask the error bits in Error Status Register and print
> a warning. After doing so, we won't see these types of errors reported
> again. What if some time passes (let's say, 2 mins), and we hit a
> condition that would normally generate an error but it's now masked? Are
> we fine with missing it? I think we should be informed about
> Uncorrectable errors as much as possible, as they indicate Link
> integrity issues.

Your understanding is correct. There's definitely more nuance/tradeoff
with uncorrectable errors (likelihood of uncorrectable spam vs.
missing critical errors). At the minimum, I think the uncorrectable
IRQ default should be higher (semi-arbitrarily chose defaults for
IRQs).

I think a dynamic (un)masking in the kernel is a bit too much and
punted the decision to userspace (e.g. rasdaemon et al.) to manage
(part of OCP Fault Management groups roadmap).

Other options include:
- only focus on correctable errors
    - seen uncorrectable spam e.g. new HW bringup but it is rarer
- some type of system-wide toggle (sysfs, kernel config/cmdline) for
uncorrectable spam handling (may be clunky)

Thanks,
Jon

Karolina Stolarek Jan. 20, 2025, 10:38 a.m. UTC | #3

On 18/01/2025 02:58, Jon Pan-Doh wrote:
> On Thu, Jan 16, 2025 at 4:02 AM Karolina Stolarek
> <karolina.stolarek@oracle.com> wrote:
>> To confirm that I understand the flow -- when we're processing
>> aer_err_info, that potentially carries a couple of errors, and we hit a
>> ratelimit, we mask the error bits in Error Status Register and print
>> a warning. After doing so, we won't see these types of errors reported
>> again. What if some time passes (let's say, 2 mins), and we hit a
>> condition that would normally generate an error but it's now masked? Are
>> we fine with missing it? I think we should be informed about
>> Uncorrectable errors as much as possible, as they indicate Link
>> integrity issues.
> 
> Your understanding is correct. There's definitely more nuance/tradeoff
> with uncorrectable errors (likelihood of uncorrectable spam vs.
> missing critical errors). At the minimum, I think the uncorrectable
> IRQ default should be higher (semi-arbitrarily chose defaults for
> IRQs).

My comment was mostly me worrying about Uncorrectable errors. Pulling 
the plug on Correctable errors after we see too many is reasonable, I think.

> I think a dynamic (un)masking in the kernel is a bit too much and
> punted the decision to userspace (e.g. rasdaemon et al.) to manage
> (part of OCP Fault Management groups roadmap).

I agree. Still, if we decide to go with IRQ masking for (Un)correctable 
errors, this should be communicated to the user in the documentation.

All the best,
Karolina

> 
> Other options include:
> - only focus on correctable errors
>      - seen uncorrectable spam e.g. new HW bringup but it is rarer
> - some type of system-wide toggle (sysfs, kernel config/cmdline) for
> uncorrectable spam handling (may be clunky)
> 
> Thanks,
> Jon

Lukas Wunner Jan. 25, 2025, 7:39 a.m. UTC | #4

On Tue, Jan 14, 2025 at 11:42:57PM -0800, Jon Pan-Doh wrote:
> After ratelimiting logs, spammy devices can still slow execution by
> continued AER IRQ servicing.
> 
> Add higher per-device ratelimits for AER errors to mask out those IRQs.
> Set the default rate to 3x default AER ratelimit (30 per 5s).

Masking errors at the register level feels overzealous,
in particular because it also disables logging via tracepoints.

Is there a concrete device that necessitates this change?
If there is, consider adding a quirk for this particular device
which masks specific errors, but doesn't affect other devices.
If there isn't, consider dropping this change until a buggy device
appears that actually needs it.

Thanks,

Lukas

Jonathan Cameron Jan. 31, 2025, 2:43 p.m. UTC | #5

On Sat, 25 Jan 2025 08:39:35 +0100
Lukas Wunner <lukas@wunner.de> wrote:

> On Tue, Jan 14, 2025 at 11:42:57PM -0800, Jon Pan-Doh wrote:
> > After ratelimiting logs, spammy devices can still slow execution by
> > continued AER IRQ servicing.
> > 
> > Add higher per-device ratelimits for AER errors to mask out those IRQs.
> > Set the default rate to 3x default AER ratelimit (30 per 5s).  
> 
> Masking errors at the register level feels overzealous,
> in particular because it also disables logging via tracepoints.
> 
> Is there a concrete device that necessitates this change?
> If there is, consider adding a quirk for this particular device
> which masks specific errors, but doesn't affect other devices.
> If there isn't, consider dropping this change until a buggy device
> appears that actually needs it.

Fully agree with this comment.  At very least this should default
to not ratelimiting on the tracepoints unless a specific opt in has
occurred (probably a platform or device driver quirk).

In particular I'd worry that you are masking whatever errors are
finally trigger masking.  That might be the only one of that
particular type that was seen and I think the only report we
see is the 'I masked it message'.  So rasdaemon for example
never sees the error at all.   So another tweak would be report
one last time so we definitely see any given error type at least
once.

For CXL errors we trigger off one AER error type (internal error),
but then that is multiplexed onto finer grained errors. Even if
we fix the above we would want the masking in the CXL RAS controls,
not AER.

Jonathan

> 
> Thanks,
> 
> Lukas
>

Jonathan Cameron Jan. 31, 2025, 2:55 p.m. UTC | #6

On Tue, 14 Jan 2025 23:42:57 -0800
Jon Pan-Doh <pandoh@google.com> wrote:

> After ratelimiting logs, spammy devices can still slow execution by
> continued AER IRQ servicing.
> 
> Add higher per-device ratelimits for AER errors to mask out those IRQs.
> Set the default rate to 3x default AER ratelimit (30 per 5s).
> 
> Tested using aer-inject[1] tool. Injected 32 AER errors. Observed IRQ
> masked via lspci and sysfs counters record 31 errors (1 masked).
> 
> Before: CEMsk: BadTLP-
> After:  CEMsk: BadTLP+
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
> 
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Hi Jon,

Comment inline.

> ---
>  Documentation/PCI/pcieaer-howto.rst |  4 +-
>  drivers/pci/pcie/aer.c              | 64 +++++++++++++++++++++++++----
>  2 files changed, 57 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
> index 5546de60f184..d41272504b18 100644
> --- a/Documentation/PCI/pcieaer-howto.rst
> +++ b/Documentation/PCI/pcieaer-howto.rst
> @@ -88,8 +88,8 @@ fields.
>  AER Ratelimits
>  -------------------------
>  
> -Error messages are ratelimited per device and error type. This prevents spammy
> -devices from flooding the console.
> +Errors, both at log and IRQ level, are ratelimited per device and error type.
> +This prevents spammy devices from stalling execution.
>  
>  AER Statistics / Counters
>  -------------------------
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 025c50b0f293..1db70ae87f52 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -87,6 +87,8 @@ struct aer_info {
>  	u64 rootport_total_nonfatal_errs;
>  
>  	/* Ratelimits for errors */
> +	struct ratelimit_state cor_irq_ratelimit;
> +	struct ratelimit_state uncor_irq_ratelimit;
>  	struct ratelimit_state cor_log_ratelimit;
>  	struct ratelimit_state uncor_log_ratelimit;
>  };
> @@ -379,6 +381,10 @@ void pci_aer_init(struct pci_dev *dev)
>  		return;
>  
>  	dev->aer_info = kzalloc(sizeof(struct aer_info), GFP_KERNEL);
> +	ratelimit_state_init(&dev->aer_info->cor_irq_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST*3);
> +	ratelimit_state_init(&dev->aer_info->uncor_irq_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST*3);
>  	ratelimit_state_init(&dev->aer_info->cor_log_ratelimit,
>  			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
>  	ratelimit_state_init(&dev->aer_info->uncor_log_ratelimit,
> @@ -676,6 +682,39 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>  	}
>  }
>  
> +static void mask_reported_error(struct pci_dev *dev, struct aer_err_info *info)
> +{
> +	const char **strings;
> +	const char *errmsg;
> +	u16 aer_offset = dev->aer_cap;
> +	u16 mask_reg_offset;
> +	u32 mask;
> +	unsigned long status = info->status;
> +	int i;
> +
> +	if (info->severity == AER_CORRECTABLE) {
> +		strings = aer_correctable_error_string;
> +		mask_reg_offset = PCI_ERR_COR_MASK;
> +	} else {
> +		strings = aer_uncorrectable_error_string;
> +		mask_reg_offset = PCI_ERR_UNCOR_MASK;
> +	}
> +
> +	pci_read_config_dword(dev, aer_offset + mask_reg_offset, &mask);
> +	mask |= status;
> +	pci_write_config_dword(dev, aer_offset + mask_reg_offset, mask);
> +
> +	pci_warn(dev, "%s error(s) masked due to rate-limiting:",
> +		 aer_error_severity_string[info->severity]);
> +	for_each_set_bit(i, &status, 32) {
> +		errmsg = strings[i];
> +		if (!errmsg)
> +			errmsg = "Unknown Error Bit";
> +
> +		pci_warn(dev, "   [%2d] %-22s\n", i, errmsg);
> +	}
> +}
> +
>  static void __print_tlp_header(struct pci_dev *dev, struct pcie_tlp_log *t)
>  {
>  	pci_err(dev, "  TLP Header: %08x %08x %08x %08x\n",
> @@ -713,7 +752,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  	int layer, agent;
>  	int id = pci_dev_id(dev);
>  	const char *level;
> -	struct ratelimit_state *ratelimit;
> +	struct ratelimit_state *irq_ratelimit;
> +	struct ratelimit_state *log_ratelimit;
>  
>  	if (!info->status) {
>  		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> @@ -722,14 +762,20 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  	}
>  
>  	if (info->severity == AER_CORRECTABLE) {
> -		ratelimit = &dev->aer_info->cor_log_ratelimit;
> +		irq_ratelimit = &dev->aer_info->cor_irq_ratelimit;
> +		log_ratelimit = &dev->aer_info->cor_log_ratelimit;
>  		level = KERN_WARNING;
>  	} else {
> -		ratelimit = &dev->aer_info->uncor_log_ratelimit;
> +		irq_ratelimit = &dev->aer_info->uncor_irq_ratelimit;
> +		log_ratelimit = &dev->aer_info->uncor_log_ratelimit;
>  		level = KERN_ERR;
>  	}
>  
> -	if (!__ratelimit(ratelimit))
> +	if (!__ratelimit(irq_ratelimit)) {
> +		mask_reported_error(dev, info);
> +		return;
So if I follow correctly.  We count irqs for any error type and
then mask whatever was set on one that triggered this rate_limit check?
That last one isn't reported other than via a log message.

Imagine that is a totally unrelated error to the earlier ones,
now RASDaemon has no info on it at all as the tracepoint never
fired.  To me that's a very different situation to it knowing there
were 10 errors of the type vs more.

I'd like to see that final trace point and also to see a tracepoint
that lets rasdaemon etc know you cut off errors after this point
+ rasdaemon support for using that.

Terry can address if we need to do anything different for CXL given
he is updating this handling for that.  Superficially I think we
need to driver the masking down into the CXL RAS capability registers.
Internal error in the AER capabilities is far to big a hammer to apply.

> +	}
> +	if (!__ratelimit(log_ratelimit))
>  		return;
>  
>  	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
> @@ -776,14 +822,14 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  	int layer, agent, tlp_header_valid = 0;
>  	u32 status, mask;
>  	struct aer_err_info info;
> -	struct ratelimit_state *ratelimit;
> +	struct ratelimit_state *log_ratelimit;
>  
>  	if (aer_severity == AER_CORRECTABLE) {
> -		ratelimit = &dev->aer_info->cor_log_ratelimit;
> +		log_ratelimit = &dev->aer_info->cor_log_ratelimit;
>  		status = aer->cor_status;
>  		mask = aer->cor_mask;
>  	} else {
> -		ratelimit = &dev->aer_info->uncor_log_ratelimit;
> +		log_ratelimit = &dev->aer_info->uncor_log_ratelimit;
>  		status = aer->uncor_status;
>  		mask = aer->uncor_mask;
>  		tlp_header_valid = status & AER_LOG_TLP_MASKS;
> @@ -799,8 +845,8 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  	info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
>  
>  	pci_dev_aer_stats_incr(dev, &info);
> -
> -	if (!__ratelimit(ratelimit))
> +	/* Only ratelimit logs (no IRQ) as AERs reported via GHES/CXL (caller). */
> +	if (!__ratelimit(log_ratelimit))
>  		return;
>  
>  	pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);

Karolina Stolarek Feb. 6, 2025, 1:56 p.m. UTC | #7

On 25/01/2025 08:39, Lukas Wunner wrote:
> 
> Masking errors at the register level feels overzealous,
> in particular because it also disables logging via tracepoints.
> 
> Is there a concrete device that necessitates this change?

I faced issues with excessive Correctable Errors reporting with Samsung 
PM1733 NVMe (a couple of thousand errors per hour), which were still 
polluting the logs even after introducing a ratelimit (first every 2s, 
second ever 30s, as proposed in [1]). Also, instead of masking the 
errors automatically, we could give a user a sysfs knob to turn error 
generation off and on.

> If there is, consider adding a quirk for this particular device
> which masks specific errors, but doesn't affect other devices.

There were many other reports of Correctable Error floods, as signaled 
in the cover letter, so it's hard to pinpoint the specific driver that 
should mask these errors.

All the best,
Karolina

-------------------------------------
[1] - 
https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/

> If there isn't, consider dropping this change until a buggy device
> appears that actually needs it.
> 
> Thanks,
> 
> Lukas

Lukas Wunner Feb. 6, 2025, 8:25 p.m. UTC | #8

On Thu, Feb 06, 2025 at 02:56:20PM +0100, Karolina Stolarek wrote:
> On 25/01/2025 08:39, Lukas Wunner wrote:
> > Masking errors at the register level feels overzealous,
> > in particular because it also disables logging via tracepoints.
> > 
> > Is there a concrete device that necessitates this change?
> 
> I faced issues with excessive Correctable Errors reporting with Samsung
> PM1733 NVMe (a couple of thousand errors per hour), which were still
> polluting the logs even after introducing a ratelimit

I'd suggest to add a "u32 aer_cor_mask" to "struct pci_dev" in the
"#ifdef CONFIG_PCIEAER" section.

Then add a "DECLARE_PCI_FIXUP_HEADER()" macro in drivers/pci/quirks.c
for the Samsung PM1733 which calls a new function which sets exactly the
error bits you're seeing to aer_cor_mask.  This should be #ifdef'ed to
CONFIG_PCIEAER as well.

Finally, amend aer.c to set the bits in aer_cor_mask in the
PCI_ERR_COR_MASK register on probe.

> > If there is, consider adding a quirk for this particular device
> > which masks specific errors, but doesn't affect other devices.
> 
> There were many other reports of Correctable Error floods, as signaled in
> the cover letter, so it's hard to pinpoint the specific driver that should
> mask these errors.

If a specific device frequently signals the same errors,
I think that's a bug of that device and if the vendor doesn't
provide a firmware update, quiescing the errors through a quirk
is a plausible solution.

Of course if this is widespread, it becomes a maintenance nightmare
and then the quirk approach is not a viable option.  I cannot say
whether that's the case.  So far there's a report for one specific
product (the Samsung drive) and hinting that the problem may be
widespread.  It's difficult to make a recommendation without
precise data.

Thanks,

Lukas

Jon Pan-Doh March 4, 2025, 11:42 p.m. UTC | #9

On Fri, Jan 31, 2025 at 6:44 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Sat, 25 Jan 2025 08:39:35 +0100
> Lukas Wunner <lukas@wunner.de> wrote:
> > Masking errors at the register level feels overzealous,
> > in particular because it also disables logging via tracepoints.
> >
> > Is there a concrete device that necessitates this change?
> > If there is, consider adding a quirk for this particular device
> > which masks specific errors, but doesn't affect other devices.
> > If there isn't, consider dropping this change until a buggy device
> > appears that actually needs it.
>
> Fully agree with this comment.  At very least this should default
> to not ratelimiting on the tracepoints unless a specific opt in has
> occurred (probably a platform or device driver quirk).

Hi Lukas and Jonathan,

Thanks for the comments. Since IRQ ratelimiting/masking is more
drastic, it requires more nuance/thought (split the series in v2[1] as
a result).

I am not targeting specific devices per say. The intent is to allow
userspace daemons/agents to dynamically collect telemetry/handle spam.
In the context of the datacenter (i.e. several OCP members), there are
many deployments of new HW/configurations where we may see/have seen
error spam when trying to enable native AER. Kernel quirks work in the
medium term (until the underlying device is fixed), but require a
kernel rollout. There is a desire to address this faster (i.e. without
rollout/reinstall) and I think IRQ ratelimiting fits the requirements.

I like the idea of having IRQ ratelimiting off as default though as it
is a big change.

> In particular I'd worry that you are masking whatever errors are
> finally trigger masking.  That might be the only one of that
> particular type that was seen and I think the only report we
> see is the 'I masked it message'.  So rasdaemon for example
> never sees the error at all.   So another tweak would be report
> one last time so we definitely see any given error type at least
> once.

Ack.

[1] https://lore.kernel.org/linux-pci/20250214023543.992372-1-pandoh@google.com/

Thanks,
Jon

Jon Pan-Doh March 4, 2025, 11:42 p.m. UTC | #10

On Fri, Jan 31, 2025 at 6:55 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
> So if I follow correctly.  We count irqs for any error type and
> then mask whatever was set on one that triggered this rate_limit check?
> That last one isn't reported other than via a log message.
>
> Imagine that is a totally unrelated error to the earlier ones,
> now RASDaemon has no info on it at all as the tracepoint never
> fired.  To me that's a very different situation to it knowing there
> were 10 errors of the type vs more.
>
> I'd like to see that final trace point and also to see a tracepoint
> that lets rasdaemon etc know you cut off errors after this point
> + rasdaemon support for using that.

Your understanding is correct. I get your point and will try to
address this when I send out the IRQ ratelimiting series.

Thanks,
Jon

[5/8] PCI/AER: Introduce ratelimit for AER IRQs

Commit Message

Comments

Patch