Message ID | 20250115074301.3514927-6-pandoh@google.com (mailing list archive) |
---|---|
State | New |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | Rate limit AER logs/IRQs | expand |
On 15/01/2025 08:42, Jon Pan-Doh wrote: > (...) > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 025c50b0f293..1db70ae87f52 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -676,6 +682,39 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev, > } > } > > +static void mask_reported_error(struct pci_dev *dev, struct aer_err_info *info) > +{ > + const char **strings; > + const char *errmsg; > + u16 aer_offset = dev->aer_cap; > + u16 mask_reg_offset; > + u32 mask; > + unsigned long status = info->status; > + int i; > + > + if (info->severity == AER_CORRECTABLE) { > + strings = aer_correctable_error_string; > + mask_reg_offset = PCI_ERR_COR_MASK; > + } else { > + strings = aer_uncorrectable_error_string; > + mask_reg_offset = PCI_ERR_UNCOR_MASK; > + } > + > + pci_read_config_dword(dev, aer_offset + mask_reg_offset, &mask); > + mask |= status; > + pci_write_config_dword(dev, aer_offset + mask_reg_offset, mask); > + > + pci_warn(dev, "%s error(s) masked due to rate-limiting:", > + aer_error_severity_string[info->severity]); > + for_each_set_bit(i, &status, 32) { > + errmsg = strings[i]; > + if (!errmsg) > + errmsg = "Unknown Error Bit"; > + > + pci_warn(dev, " [%2d] %-22s\n", i, errmsg); > + } > +} > + My approach was more heavy-handed :) To confirm that I understand the flow -- when we're processing aer_err_info, that potentially carries a couple of errors, and we hit a ratelimit, we mask the error bits in Error Status Register and print a warning. After doing so, we won't see these types of errors reported again. What if some time passes (let's say, 2 mins), and we hit a condition that would normally generate an error but it's now masked? Are we fine with missing it? I think we should be informed about Uncorrectable errors as much as possible, as they indicate Link integrity issues. > - if (!__ratelimit(ratelimit)) > + if (!__ratelimit(irq_ratelimit)) { > + mask_reported_error(dev, info); > + return; > + } > + if (!__ratelimit(log_ratelimit)) > return; Nit: add a line between these two ifs All the best, Karolina
On Thu, Jan 16, 2025 at 4:02 AM Karolina Stolarek <karolina.stolarek@oracle.com> wrote: > To confirm that I understand the flow -- when we're processing > aer_err_info, that potentially carries a couple of errors, and we hit a > ratelimit, we mask the error bits in Error Status Register and print > a warning. After doing so, we won't see these types of errors reported > again. What if some time passes (let's say, 2 mins), and we hit a > condition that would normally generate an error but it's now masked? Are > we fine with missing it? I think we should be informed about > Uncorrectable errors as much as possible, as they indicate Link > integrity issues. Your understanding is correct. There's definitely more nuance/tradeoff with uncorrectable errors (likelihood of uncorrectable spam vs. missing critical errors). At the minimum, I think the uncorrectable IRQ default should be higher (semi-arbitrarily chose defaults for IRQs). I think a dynamic (un)masking in the kernel is a bit too much and punted the decision to userspace (e.g. rasdaemon et al.) to manage (part of OCP Fault Management groups roadmap). Other options include: - only focus on correctable errors - seen uncorrectable spam e.g. new HW bringup but it is rarer - some type of system-wide toggle (sysfs, kernel config/cmdline) for uncorrectable spam handling (may be clunky) Thanks, Jon
On 18/01/2025 02:58, Jon Pan-Doh wrote: > On Thu, Jan 16, 2025 at 4:02 AM Karolina Stolarek > <karolina.stolarek@oracle.com> wrote: >> To confirm that I understand the flow -- when we're processing >> aer_err_info, that potentially carries a couple of errors, and we hit a >> ratelimit, we mask the error bits in Error Status Register and print >> a warning. After doing so, we won't see these types of errors reported >> again. What if some time passes (let's say, 2 mins), and we hit a >> condition that would normally generate an error but it's now masked? Are >> we fine with missing it? I think we should be informed about >> Uncorrectable errors as much as possible, as they indicate Link >> integrity issues. > > Your understanding is correct. There's definitely more nuance/tradeoff > with uncorrectable errors (likelihood of uncorrectable spam vs. > missing critical errors). At the minimum, I think the uncorrectable > IRQ default should be higher (semi-arbitrarily chose defaults for > IRQs). My comment was mostly me worrying about Uncorrectable errors. Pulling the plug on Correctable errors after we see too many is reasonable, I think. > I think a dynamic (un)masking in the kernel is a bit too much and > punted the decision to userspace (e.g. rasdaemon et al.) to manage > (part of OCP Fault Management groups roadmap). I agree. Still, if we decide to go with IRQ masking for (Un)correctable errors, this should be communicated to the user in the documentation. All the best, Karolina > > Other options include: > - only focus on correctable errors > - seen uncorrectable spam e.g. new HW bringup but it is rarer > - some type of system-wide toggle (sysfs, kernel config/cmdline) for > uncorrectable spam handling (may be clunky) > > Thanks, > Jon
On Tue, Jan 14, 2025 at 11:42:57PM -0800, Jon Pan-Doh wrote: > After ratelimiting logs, spammy devices can still slow execution by > continued AER IRQ servicing. > > Add higher per-device ratelimits for AER errors to mask out those IRQs. > Set the default rate to 3x default AER ratelimit (30 per 5s). Masking errors at the register level feels overzealous, in particular because it also disables logging via tracepoints. Is there a concrete device that necessitates this change? If there is, consider adding a quirk for this particular device which masks specific errors, but doesn't affect other devices. If there isn't, consider dropping this change until a buggy device appears that actually needs it. Thanks, Lukas
On Sat, 25 Jan 2025 08:39:35 +0100 Lukas Wunner <lukas@wunner.de> wrote: > On Tue, Jan 14, 2025 at 11:42:57PM -0800, Jon Pan-Doh wrote: > > After ratelimiting logs, spammy devices can still slow execution by > > continued AER IRQ servicing. > > > > Add higher per-device ratelimits for AER errors to mask out those IRQs. > > Set the default rate to 3x default AER ratelimit (30 per 5s). > > Masking errors at the register level feels overzealous, > in particular because it also disables logging via tracepoints. > > Is there a concrete device that necessitates this change? > If there is, consider adding a quirk for this particular device > which masks specific errors, but doesn't affect other devices. > If there isn't, consider dropping this change until a buggy device > appears that actually needs it. Fully agree with this comment. At very least this should default to not ratelimiting on the tracepoints unless a specific opt in has occurred (probably a platform or device driver quirk). In particular I'd worry that you are masking whatever errors are finally trigger masking. That might be the only one of that particular type that was seen and I think the only report we see is the 'I masked it message'. So rasdaemon for example never sees the error at all. So another tweak would be report one last time so we definitely see any given error type at least once. For CXL errors we trigger off one AER error type (internal error), but then that is multiplexed onto finer grained errors. Even if we fix the above we would want the masking in the CXL RAS controls, not AER. Jonathan > > Thanks, > > Lukas >
On Tue, 14 Jan 2025 23:42:57 -0800 Jon Pan-Doh <pandoh@google.com> wrote: > After ratelimiting logs, spammy devices can still slow execution by > continued AER IRQ servicing. > > Add higher per-device ratelimits for AER errors to mask out those IRQs. > Set the default rate to 3x default AER ratelimit (30 per 5s). > > Tested using aer-inject[1] tool. Injected 32 AER errors. Observed IRQ > masked via lspci and sysfs counters record 31 errors (1 masked). > > Before: CEMsk: BadTLP- > After: CEMsk: BadTLP+ > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git > > Signed-off-by: Jon Pan-Doh <pandoh@google.com> Hi Jon, Comment inline. > --- > Documentation/PCI/pcieaer-howto.rst | 4 +- > drivers/pci/pcie/aer.c | 64 +++++++++++++++++++++++++---- > 2 files changed, 57 insertions(+), 11 deletions(-) > > diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst > index 5546de60f184..d41272504b18 100644 > --- a/Documentation/PCI/pcieaer-howto.rst > +++ b/Documentation/PCI/pcieaer-howto.rst > @@ -88,8 +88,8 @@ fields. > AER Ratelimits > ------------------------- > > -Error messages are ratelimited per device and error type. This prevents spammy > -devices from flooding the console. > +Errors, both at log and IRQ level, are ratelimited per device and error type. > +This prevents spammy devices from stalling execution. > > AER Statistics / Counters > ------------------------- > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 025c50b0f293..1db70ae87f52 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -87,6 +87,8 @@ struct aer_info { > u64 rootport_total_nonfatal_errs; > > /* Ratelimits for errors */ > + struct ratelimit_state cor_irq_ratelimit; > + struct ratelimit_state uncor_irq_ratelimit; > struct ratelimit_state cor_log_ratelimit; > struct ratelimit_state uncor_log_ratelimit; > }; > @@ -379,6 +381,10 @@ void pci_aer_init(struct pci_dev *dev) > return; > > dev->aer_info = kzalloc(sizeof(struct aer_info), GFP_KERNEL); > + ratelimit_state_init(&dev->aer_info->cor_irq_ratelimit, > + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST*3); > + ratelimit_state_init(&dev->aer_info->uncor_irq_ratelimit, > + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST*3); > ratelimit_state_init(&dev->aer_info->cor_log_ratelimit, > DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); > ratelimit_state_init(&dev->aer_info->uncor_log_ratelimit, > @@ -676,6 +682,39 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev, > } > } > > +static void mask_reported_error(struct pci_dev *dev, struct aer_err_info *info) > +{ > + const char **strings; > + const char *errmsg; > + u16 aer_offset = dev->aer_cap; > + u16 mask_reg_offset; > + u32 mask; > + unsigned long status = info->status; > + int i; > + > + if (info->severity == AER_CORRECTABLE) { > + strings = aer_correctable_error_string; > + mask_reg_offset = PCI_ERR_COR_MASK; > + } else { > + strings = aer_uncorrectable_error_string; > + mask_reg_offset = PCI_ERR_UNCOR_MASK; > + } > + > + pci_read_config_dword(dev, aer_offset + mask_reg_offset, &mask); > + mask |= status; > + pci_write_config_dword(dev, aer_offset + mask_reg_offset, mask); > + > + pci_warn(dev, "%s error(s) masked due to rate-limiting:", > + aer_error_severity_string[info->severity]); > + for_each_set_bit(i, &status, 32) { > + errmsg = strings[i]; > + if (!errmsg) > + errmsg = "Unknown Error Bit"; > + > + pci_warn(dev, " [%2d] %-22s\n", i, errmsg); > + } > +} > + > static void __print_tlp_header(struct pci_dev *dev, struct pcie_tlp_log *t) > { > pci_err(dev, " TLP Header: %08x %08x %08x %08x\n", > @@ -713,7 +752,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > int layer, agent; > int id = pci_dev_id(dev); > const char *level; > - struct ratelimit_state *ratelimit; > + struct ratelimit_state *irq_ratelimit; > + struct ratelimit_state *log_ratelimit; > > if (!info->status) { > pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n", > @@ -722,14 +762,20 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > } > > if (info->severity == AER_CORRECTABLE) { > - ratelimit = &dev->aer_info->cor_log_ratelimit; > + irq_ratelimit = &dev->aer_info->cor_irq_ratelimit; > + log_ratelimit = &dev->aer_info->cor_log_ratelimit; > level = KERN_WARNING; > } else { > - ratelimit = &dev->aer_info->uncor_log_ratelimit; > + irq_ratelimit = &dev->aer_info->uncor_irq_ratelimit; > + log_ratelimit = &dev->aer_info->uncor_log_ratelimit; > level = KERN_ERR; > } > > - if (!__ratelimit(ratelimit)) > + if (!__ratelimit(irq_ratelimit)) { > + mask_reported_error(dev, info); > + return; So if I follow correctly. We count irqs for any error type and then mask whatever was set on one that triggered this rate_limit check? That last one isn't reported other than via a log message. Imagine that is a totally unrelated error to the earlier ones, now RASDaemon has no info on it at all as the tracepoint never fired. To me that's a very different situation to it knowing there were 10 errors of the type vs more. I'd like to see that final trace point and also to see a tracepoint that lets rasdaemon etc know you cut off errors after this point + rasdaemon support for using that. Terry can address if we need to do anything different for CXL given he is updating this handling for that. Superficially I think we need to driver the masking down into the CXL RAS capability registers. Internal error in the AER capabilities is far to big a hammer to apply. > + } > + if (!__ratelimit(log_ratelimit)) > return; > > layer = AER_GET_LAYER_ERROR(info->severity, info->status); > @@ -776,14 +822,14 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity, > int layer, agent, tlp_header_valid = 0; > u32 status, mask; > struct aer_err_info info; > - struct ratelimit_state *ratelimit; > + struct ratelimit_state *log_ratelimit; > > if (aer_severity == AER_CORRECTABLE) { > - ratelimit = &dev->aer_info->cor_log_ratelimit; > + log_ratelimit = &dev->aer_info->cor_log_ratelimit; > status = aer->cor_status; > mask = aer->cor_mask; > } else { > - ratelimit = &dev->aer_info->uncor_log_ratelimit; > + log_ratelimit = &dev->aer_info->uncor_log_ratelimit; > status = aer->uncor_status; > mask = aer->uncor_mask; > tlp_header_valid = status & AER_LOG_TLP_MASKS; > @@ -799,8 +845,8 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity, > info.first_error = PCI_ERR_CAP_FEP(aer->cap_control); > > pci_dev_aer_stats_incr(dev, &info); > - > - if (!__ratelimit(ratelimit)) > + /* Only ratelimit logs (no IRQ) as AERs reported via GHES/CXL (caller). */ > + if (!__ratelimit(log_ratelimit)) > return; > > pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
On 25/01/2025 08:39, Lukas Wunner wrote: > > Masking errors at the register level feels overzealous, > in particular because it also disables logging via tracepoints. > > Is there a concrete device that necessitates this change? I faced issues with excessive Correctable Errors reporting with Samsung PM1733 NVMe (a couple of thousand errors per hour), which were still polluting the logs even after introducing a ratelimit (first every 2s, second ever 30s, as proposed in [1]). Also, instead of masking the errors automatically, we could give a user a sysfs knob to turn error generation off and on. > If there is, consider adding a quirk for this particular device > which masks specific errors, but doesn't affect other devices. There were many other reports of Correctable Error floods, as signaled in the cover letter, so it's hard to pinpoint the specific driver that should mask these errors. All the best, Karolina ------------------------------------- [1] - https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/ > If there isn't, consider dropping this change until a buggy device > appears that actually needs it. > > Thanks, > > Lukas
On Thu, Feb 06, 2025 at 02:56:20PM +0100, Karolina Stolarek wrote: > On 25/01/2025 08:39, Lukas Wunner wrote: > > Masking errors at the register level feels overzealous, > > in particular because it also disables logging via tracepoints. > > > > Is there a concrete device that necessitates this change? > > I faced issues with excessive Correctable Errors reporting with Samsung > PM1733 NVMe (a couple of thousand errors per hour), which were still > polluting the logs even after introducing a ratelimit I'd suggest to add a "u32 aer_cor_mask" to "struct pci_dev" in the "#ifdef CONFIG_PCIEAER" section. Then add a "DECLARE_PCI_FIXUP_HEADER()" macro in drivers/pci/quirks.c for the Samsung PM1733 which calls a new function which sets exactly the error bits you're seeing to aer_cor_mask. This should be #ifdef'ed to CONFIG_PCIEAER as well. Finally, amend aer.c to set the bits in aer_cor_mask in the PCI_ERR_COR_MASK register on probe. > > If there is, consider adding a quirk for this particular device > > which masks specific errors, but doesn't affect other devices. > > There were many other reports of Correctable Error floods, as signaled in > the cover letter, so it's hard to pinpoint the specific driver that should > mask these errors. If a specific device frequently signals the same errors, I think that's a bug of that device and if the vendor doesn't provide a firmware update, quiescing the errors through a quirk is a plausible solution. Of course if this is widespread, it becomes a maintenance nightmare and then the quirk approach is not a viable option. I cannot say whether that's the case. So far there's a report for one specific product (the Samsung drive) and hinting that the problem may be widespread. It's difficult to make a recommendation without precise data. Thanks, Lukas
diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst index 5546de60f184..d41272504b18 100644 --- a/Documentation/PCI/pcieaer-howto.rst +++ b/Documentation/PCI/pcieaer-howto.rst @@ -88,8 +88,8 @@ fields. AER Ratelimits ------------------------- -Error messages are ratelimited per device and error type. This prevents spammy -devices from flooding the console. +Errors, both at log and IRQ level, are ratelimited per device and error type. +This prevents spammy devices from stalling execution. AER Statistics / Counters ------------------------- diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 025c50b0f293..1db70ae87f52 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -87,6 +87,8 @@ struct aer_info { u64 rootport_total_nonfatal_errs; /* Ratelimits for errors */ + struct ratelimit_state cor_irq_ratelimit; + struct ratelimit_state uncor_irq_ratelimit; struct ratelimit_state cor_log_ratelimit; struct ratelimit_state uncor_log_ratelimit; }; @@ -379,6 +381,10 @@ void pci_aer_init(struct pci_dev *dev) return; dev->aer_info = kzalloc(sizeof(struct aer_info), GFP_KERNEL); + ratelimit_state_init(&dev->aer_info->cor_irq_ratelimit, + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST*3); + ratelimit_state_init(&dev->aer_info->uncor_irq_ratelimit, + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST*3); ratelimit_state_init(&dev->aer_info->cor_log_ratelimit, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); ratelimit_state_init(&dev->aer_info->uncor_log_ratelimit, @@ -676,6 +682,39 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev, } } +static void mask_reported_error(struct pci_dev *dev, struct aer_err_info *info) +{ + const char **strings; + const char *errmsg; + u16 aer_offset = dev->aer_cap; + u16 mask_reg_offset; + u32 mask; + unsigned long status = info->status; + int i; + + if (info->severity == AER_CORRECTABLE) { + strings = aer_correctable_error_string; + mask_reg_offset = PCI_ERR_COR_MASK; + } else { + strings = aer_uncorrectable_error_string; + mask_reg_offset = PCI_ERR_UNCOR_MASK; + } + + pci_read_config_dword(dev, aer_offset + mask_reg_offset, &mask); + mask |= status; + pci_write_config_dword(dev, aer_offset + mask_reg_offset, mask); + + pci_warn(dev, "%s error(s) masked due to rate-limiting:", + aer_error_severity_string[info->severity]); + for_each_set_bit(i, &status, 32) { + errmsg = strings[i]; + if (!errmsg) + errmsg = "Unknown Error Bit"; + + pci_warn(dev, " [%2d] %-22s\n", i, errmsg); + } +} + static void __print_tlp_header(struct pci_dev *dev, struct pcie_tlp_log *t) { pci_err(dev, " TLP Header: %08x %08x %08x %08x\n", @@ -713,7 +752,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) int layer, agent; int id = pci_dev_id(dev); const char *level; - struct ratelimit_state *ratelimit; + struct ratelimit_state *irq_ratelimit; + struct ratelimit_state *log_ratelimit; if (!info->status) { pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n", @@ -722,14 +762,20 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) } if (info->severity == AER_CORRECTABLE) { - ratelimit = &dev->aer_info->cor_log_ratelimit; + irq_ratelimit = &dev->aer_info->cor_irq_ratelimit; + log_ratelimit = &dev->aer_info->cor_log_ratelimit; level = KERN_WARNING; } else { - ratelimit = &dev->aer_info->uncor_log_ratelimit; + irq_ratelimit = &dev->aer_info->uncor_irq_ratelimit; + log_ratelimit = &dev->aer_info->uncor_log_ratelimit; level = KERN_ERR; } - if (!__ratelimit(ratelimit)) + if (!__ratelimit(irq_ratelimit)) { + mask_reported_error(dev, info); + return; + } + if (!__ratelimit(log_ratelimit)) return; layer = AER_GET_LAYER_ERROR(info->severity, info->status); @@ -776,14 +822,14 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity, int layer, agent, tlp_header_valid = 0; u32 status, mask; struct aer_err_info info; - struct ratelimit_state *ratelimit; + struct ratelimit_state *log_ratelimit; if (aer_severity == AER_CORRECTABLE) { - ratelimit = &dev->aer_info->cor_log_ratelimit; + log_ratelimit = &dev->aer_info->cor_log_ratelimit; status = aer->cor_status; mask = aer->cor_mask; } else { - ratelimit = &dev->aer_info->uncor_log_ratelimit; + log_ratelimit = &dev->aer_info->uncor_log_ratelimit; status = aer->uncor_status; mask = aer->uncor_mask; tlp_header_valid = status & AER_LOG_TLP_MASKS; @@ -799,8 +845,8 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity, info.first_error = PCI_ERR_CAP_FEP(aer->cap_control); pci_dev_aer_stats_incr(dev, &info); - - if (!__ratelimit(ratelimit)) + /* Only ratelimit logs (no IRQ) as AERs reported via GHES/CXL (caller). */ + if (!__ratelimit(log_ratelimit)) return; pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
After ratelimiting logs, spammy devices can still slow execution by continued AER IRQ servicing. Add higher per-device ratelimits for AER errors to mask out those IRQs. Set the default rate to 3x default AER ratelimit (30 per 5s). Tested using aer-inject[1] tool. Injected 32 AER errors. Observed IRQ masked via lspci and sysfs counters record 31 errors (1 masked). Before: CEMsk: BadTLP- After: CEMsk: BadTLP+ [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git Signed-off-by: Jon Pan-Doh <pandoh@google.com> --- Documentation/PCI/pcieaer-howto.rst | 4 +- drivers/pci/pcie/aer.c | 64 +++++++++++++++++++++++++---- 2 files changed, 57 insertions(+), 11 deletions(-)