Message ID | 20250321015806.954866-6-pandoh@google.com (mailing list archive) |
---|---|
State | New |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | Rate limit AER logs | expand |
On Thu, Mar 20, 2025 at 06:58:03PM -0700, Jon Pan-Doh wrote: > Update name to reflect the broader definition of structs/variables that > are stored (e.g. ratelimits). This is a preparatory patch for adding rate > limit support. > > Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com> > Signed-off-by: Jon Pan-Doh <pandoh@google.com> > Reported-by: Sargun Dhillon <sargun@meta.com> What did Sargun report? Is there a bug fix in here? Can we include a URL to whatever Sargun reported?
On Fri, Mar 21, 2025 at 3:01 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > What did Sargun report? Is there a bug fix in here? Can we include a > URL to whatever Sargun reported? Paul added Reported-By and Acked-By tags[1] for the series which I applied to each patch. Checkpatch mildly complained about not having a Closes tag for Reported-By. Sargun/Paul, do you have a Closes/URL tag? Thanks, Jon
On Fri, Mar 21, 2025 at 05:01:15PM -0500, Bjorn Helgaas wrote: > On Thu, Mar 20, 2025 at 06:58:03PM -0700, Jon Pan-Doh wrote: > > Update name to reflect the broader definition of structs/variables that > > are stored (e.g. ratelimits). This is a preparatory patch for adding rate > > limit support. > > > > Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com> > > Signed-off-by: Jon Pan-Doh <pandoh@google.com> > > Reported-by: Sargun Dhillon <sargun@meta.com> > > What did Sargun report? Is there a bug fix in here? Can we include a > URL to whatever Sargun reported? He reported RCU CPU stall warnings and CSD-lock warnings internally within Meta, so sorry, no useful URL. I did put together a series of hacks that fix the problem: (1) Disabling __aer_print_error() entirely, (2) Disabling __aer_print_error() printk() and sysfs, (3) Disabling only __aer_print_error() printk(), and finally (4) Throttling __aer_print_error() printk(). Jon's patch looks to cover my #4 plus it looks to allow run-time control of the throttling. So my patch is strictly a stop-the-bleeding measure for Meta's fleet while this patch series makes its way upstream. I do plan to look at Jon's patch in more detail when he posts the next version. Fair enough? Thanx, Paul
On Fri, Mar 21, 2025 at 03:15:26PM -0700, Jon Pan-Doh wrote: > On Fri, Mar 21, 2025 at 3:01 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > What did Sargun report? Is there a bug fix in here? Can we include a > > URL to whatever Sargun reported? > > Paul added Reported-By and Acked-By tags[1] for the series which I > applied to each patch. Checkpatch mildly complained about not having a > Closes tag for Reported-By. > > Sargun/Paul, do you have a Closes/URL tag? The URL of this email? Perhaps more productively, I hope that we can run tests on this series, but it will take some time. Thanx, Paul
On Fri, Mar 21, 2025 at 03:16:48PM -0700, Paul E. McKenney wrote: > On Fri, Mar 21, 2025 at 05:01:15PM -0500, Bjorn Helgaas wrote: > > On Thu, Mar 20, 2025 at 06:58:03PM -0700, Jon Pan-Doh wrote: > > > Update name to reflect the broader definition of structs/variables that > > > are stored (e.g. ratelimits). This is a preparatory patch for adding rate > > > limit support. > > > > > > Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com> > > > Signed-off-by: Jon Pan-Doh <pandoh@google.com> > > > Reported-by: Sargun Dhillon <sargun@meta.com> > > > > What did Sargun report? Is there a bug fix in here? Can we include a > > URL to whatever Sargun reported? > > He reported RCU CPU stall warnings and CSD-lock warnings internally > within Meta, so sorry, no useful URL. Oh, I see now how this happened via your ack email, Paul. So I think it would make sense for Jon to add Sargun's reported-by to "[PATCH v5 6/8] PCI/AER: Introduce ratelimit for error logs", along with a line in the commit log connecting Sargun with the RCU CPU stall warnings, etc., because that's the patch that actually addresses those warnings. I wouldn't add it to the other patches because it's just confusing. Bjorn
On Fri, Mar 21, 2025 at 05:39:30PM -0500, Bjorn Helgaas wrote: > On Fri, Mar 21, 2025 at 03:16:48PM -0700, Paul E. McKenney wrote: > > On Fri, Mar 21, 2025 at 05:01:15PM -0500, Bjorn Helgaas wrote: > > > On Thu, Mar 20, 2025 at 06:58:03PM -0700, Jon Pan-Doh wrote: > > > > Update name to reflect the broader definition of structs/variables that > > > > are stored (e.g. ratelimits). This is a preparatory patch for adding rate > > > > limit support. > > > > > > > > Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com> > > > > Signed-off-by: Jon Pan-Doh <pandoh@google.com> > > > > Reported-by: Sargun Dhillon <sargun@meta.com> > > > > > > What did Sargun report? Is there a bug fix in here? Can we include a > > > URL to whatever Sargun reported? > > > > He reported RCU CPU stall warnings and CSD-lock warnings internally > > within Meta, so sorry, no useful URL. > > Oh, I see now how this happened via your ack email, Paul. So I think > it would make sense for Jon to add Sargun's reported-by to "[PATCH v5 > 6/8] PCI/AER: Introduce ratelimit for error logs", along with a line > in the commit log connecting Sargun with the RCU CPU stall warnings, > etc., because that's the patch that actually addresses those warnings. > > I wouldn't add it to the other patches because it's just confusing. Works for me! Thanx, Paul
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 3c63a6963608..f657edca8769 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -54,11 +54,11 @@ struct aer_rpc { DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX); }; -/* AER stats for the device */ -struct aer_stats { +/* AER report for the device */ +struct aer_report { /* - * Fields for all AER capable devices. They indicate the errors + * Stats for all AER capable devices. They indicate the errors * "as seen by this device". Note that this may mean that if an * end point is causing problems, the AER counters may increment * at its link partner (e.g. root port) because the errors will be @@ -80,7 +80,7 @@ struct aer_stats { u64 dev_total_nonfatal_errs; /* - * Fields for Root ports & root complex event collectors only, these + * Stats for Root ports & root complex event collectors only, these * indicate the total number of ERR_COR, ERR_FATAL, and ERR_NONFATAL * messages received by the root port / event collector, INCLUDING the * ones that are generated internally (by the rootport itself) @@ -377,7 +377,7 @@ void pci_aer_init(struct pci_dev *dev) if (!dev->aer_cap) return; - dev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL); + dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL); /* * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER, @@ -398,8 +398,8 @@ void pci_aer_init(struct pci_dev *dev) void pci_aer_exit(struct pci_dev *dev) { - kfree(dev->aer_stats); - dev->aer_stats = NULL; + kfree(dev->aer_report); + dev->aer_report = NULL; } #define AER_AGENT_RECEIVER 0 @@ -537,10 +537,10 @@ static const char *aer_agent_string[] = { { \ unsigned int i; \ struct pci_dev *pdev = to_pci_dev(dev); \ - u64 *stats = pdev->aer_stats->stats_array; \ + u64 *stats = pdev->aer_report->stats_array; \ size_t len = 0; \ \ - for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\ + for (i = 0; i < ARRAY_SIZE(pdev->aer_report->stats_array); i++) {\ if (strings_array[i]) \ len += sysfs_emit_at(buf, len, "%s %llu\n", \ strings_array[i], \ @@ -551,7 +551,7 @@ static const char *aer_agent_string[] = { i, stats[i]); \ } \ len += sysfs_emit_at(buf, len, "TOTAL_%s %llu\n", total_string, \ - pdev->aer_stats->total_field); \ + pdev->aer_report->total_field); \ return len; \ } \ static DEVICE_ATTR_RO(name) @@ -572,7 +572,7 @@ aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs, char *buf) \ { \ struct pci_dev *pdev = to_pci_dev(dev); \ - return sysfs_emit(buf, "%llu\n", pdev->aer_stats->field); \ + return sysfs_emit(buf, "%llu\n", pdev->aer_report->field); \ } \ static DEVICE_ATTR_RO(name) @@ -599,7 +599,7 @@ static umode_t aer_stats_attrs_are_visible(struct kobject *kobj, struct device *dev = kobj_to_dev(kobj); struct pci_dev *pdev = to_pci_dev(dev); - if (!pdev->aer_stats) + if (!pdev->aer_report) return 0; if ((a == &dev_attr_aer_rootport_total_err_cor.attr || @@ -622,25 +622,25 @@ void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info) unsigned long status = info->status & ~info->mask; int i, max = -1; u64 *counter = NULL; - struct aer_stats *aer_stats = pdev->aer_stats; + struct aer_report *aer_report = pdev->aer_report; - if (!aer_stats) + if (!aer_report) return; switch (info->severity) { case AER_CORRECTABLE: - aer_stats->dev_total_cor_errs++; - counter = &aer_stats->dev_cor_errs[0]; + aer_report->dev_total_cor_errs++; + counter = &aer_report->dev_cor_errs[0]; max = AER_MAX_TYPEOF_COR_ERRS; break; case AER_NONFATAL: - aer_stats->dev_total_nonfatal_errs++; - counter = &aer_stats->dev_nonfatal_errs[0]; + aer_report->dev_total_nonfatal_errs++; + counter = &aer_report->dev_nonfatal_errs[0]; max = AER_MAX_TYPEOF_UNCOR_ERRS; break; case AER_FATAL: - aer_stats->dev_total_fatal_errs++; - counter = &aer_stats->dev_fatal_errs[0]; + aer_report->dev_total_fatal_errs++; + counter = &aer_report->dev_fatal_errs[0]; max = AER_MAX_TYPEOF_UNCOR_ERRS; break; } @@ -652,19 +652,19 @@ void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info) static void pci_rootport_aer_stats_incr(struct pci_dev *pdev, struct aer_err_source *e_src) { - struct aer_stats *aer_stats = pdev->aer_stats; + struct aer_report *aer_report = pdev->aer_report; - if (!aer_stats) + if (!aer_report) return; if (e_src->status & PCI_ERR_ROOT_COR_RCV) - aer_stats->rootport_total_cor_errs++; + aer_report->rootport_total_cor_errs++; if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) { if (e_src->status & PCI_ERR_ROOT_FATAL_RCV) - aer_stats->rootport_total_fatal_errs++; + aer_report->rootport_total_fatal_errs++; else - aer_stats->rootport_total_nonfatal_errs++; + aer_report->rootport_total_nonfatal_errs++; } } diff --git a/include/linux/pci.h b/include/linux/pci.h index e4bf67bf8172..900edb6f8f62 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -346,7 +346,7 @@ struct pci_dev { u8 hdr_type; /* PCI header type (`multi' flag masked out) */ #ifdef CONFIG_PCIEAER u16 aer_cap; /* AER capability offset */ - struct aer_stats *aer_stats; /* AER stats for this device */ + struct aer_report *aer_report; /* AER report for this device */ #endif #ifdef CONFIG_PCIEPORTBUS struct rcec_ea *rcec_ea; /* RCEC cached endpoint association */